NCEE 2009-4038 



U.S. DEPARTMENT OF EDUCATION 



Reading First Impact Study 

Final Report 




NATIONAL CENTER for 
EDUCATION EVALUATION 
and REGIONAL ASSISTANCE 



Institute of Education Sciences 




Reading First Impact Study 
Final Report 

NOVEMBER 2008 



Beth C. Gamse, Project Director, Abt Associates 

Robin Tepper Jacob, Abt Associates/University of Michigan 

Megan Horst, Abt Associates 

Beth Boulay, Abt Associates 

Fatih Unlu, Abt Associates 

Laurie Bozzi 
Linda Caswell 
Chris Rodger 
W. Carter Smith 

Abt Associates 

Nancy Brigham 
Sheila Rosenblum 

Rosenblum Brigham Associates 

With the assistance of 

Howard Bloom 
Yequin He 
Corinne Herlihy 
James Kemple 
Don Laliberty 
Ken Lam 
Kenyon Maree 
Rachel McCormick 
Rebecca Unterman 
Pei Zhu 



NCEE 2009-4038 

U.S. DEPARTMENT OF EDUCATION 




NATIONAL CENTER ran 

EDUCATION EVALUATION 
*ND REGIONAL ASSISTANCE 



I n ■ 1 1 F u I « -a I fc if i c c I i q n b c i « *i c « i 



This report was prepared for the Institute of Education Sciences under Contract No. ED-Ol-CO- 
0093/0004. The project officer was Tracy Rimdzius in the National Center for Education Evaluation and 
Regional Assistance. 

U.S. Department of Education 

Margaret Spellings 
Secretary 

Institute of Education Sciences 

Grover J. Whitehurst 
Director 

National Center for Education Evaluation and Regional Assistance 

Phoebe Cottingham 
Commissioner 



November 2008 

This report is in the public domain. Authorization to reproduce it in whole or in part is granted. While 
permission to reprint this publication is not necessary, the citation should be: Gamse, B.C., Jacob, R.T., 
Horst, M., Boulay, B., and Unlu, F. (2008). Reading First Impact Study Final Report (NCEE 2009-4038). 
Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of 
Education Sciences, U.S. Department of Education. 

IES evaluation reports present objective information on the conditions of implementation and impacts of 
the programs being evaluated. IES evaluation reports do not include conclusions or recommendations or 
views with regard to actions policymakers or practitioners should take in light of the findings in the 
reports. 



To order copies of this report, 

• Write to ED Pubs, Education Publications Center, U.S. Department of Education, P.O. Box 1398, 
Jessup, MD 20794-1398. 

• Call in your request toll free to l-877-4ED-Pubs. If 877 service is not yet available in your area, 
call 800-872-5327 (800-USA-LEARN). Those who use a telecommunications device for the deaf 
(TDD) or a teletypewriter (TTY) should call 800-437-0833. 

• Fax your request to 301-470-1244. 

• Order online at www.edpubs.org. 

This report also is available on the IES website at http://ncee.ed.gov. 



Alternate Formats 



Upon request, this report is available in alternate formats such as Braille, large print, audiotape, or 
computer diskette. For more information, please contact the Department's Alternate Format Center at 202- 
260-9895 or 202-205-81 13. 



Acknowledgements 



The Reading First Impact Study Team would like to tha nk the students, faculty, and staff in the study’s 
participating schools and districts. Their contributions to the study (via assessments, observations, 
surveys, and more) are deeply appreciated. We are the beneficiaries of their generosity of time and spirit. 

The listed authors of this report represent only a small part of the team involved in this project. We would 
like to acknowledge the support of staff from Computer Technology Services (for the study’s data 
collection website), from DataStar (for data entry), from MDRC, from Retail Solutions at Work (and the 
hundreds of classroom observers who participated in intensive training and data collection activities), 
from Paladin Pictures (for developing training videos for classroom observations), from RMC Research 
(especially Chris Dwyer, for help on developing instruments and on training observers), from Rosenblum- 
Brigham Associates (for district site visits), from Westat (Sherry Sanborne and Alex Ratnofsky, for 
managing the student assessment, and the Student Assessment Coordinators and test administrators), and 
from Westover (Wanda Camper, LaKisha Dyson, and Pamela Wallace for helping with meeting 
logistics). 

The study has also benefited from both external and internal technical advisors, including: 



External Advisors 


Internal Advisors 


Josh Angrist 


Steve Bell (A) 


David Card 


Gordon Berlin (M) 


Robert Brennan 


Nancy Burstein (A) 


Thomas Cook* 


Fred Doolittle (M) 


Jack Fletcher* 


Barbara Goodson (A) 


David Francis 


John Hutchins (M) 


Larry Hedges* 


Jacob Klennan (A) 


Robinson Hollister* 


Marc Moss (A) 


Guido Imbens 


Chuck Michalopoulous (M) 


Brian Jacob 


Lany Orr (A) 


David Lee 


Cris Price (A) 


Sean Reardon 


Janet Quint (M) 


Tim Shanahan* 


Howard Rolston (A) 


Judy Singer 




Jeff Smith 


(A — Abt Associates) 


Faith Stevens* 


(M— MDRC) 


Petra Todd 




Wilbert Van der Klaauw 




Sharon Vaughn* 





* Individuals who have served on the study’s Technical Work Group 

We also want to recognize the steady contributions of Abt-SRBI staff, including Brenda Rodriguez, Fran 
Coffey, Kay Ely, Joanne Melton, Judy Meyer, Lynn Reneau, Davyd Roskilly, Jon Schmalz, Estella Sena, 
and Judy Walker, who were instrumental in completing multiple data collections, and Eileen Fahey, 
Katheleen Linton, and Jan Nicholson for countless hours of production support. Finally, we want to 
acknowledge Diane Greene, whose wisdom helped us all. 



iii 



Acknowledgements 





Disclosure of Potential Conflicts of Interests 1 

The research team for this evaluation consists of a prime contractor, Abt Associates, and two major 
subcontractors, MDRC and Westat. None of these organizations or their key staff has financial interests 
that could be affected by findings from the Reading First Impact Study. No one on the Technical Work 
Group, convened to provide advice and guidance, has financial interests that could be affected by findings 
from the evaluation. 



1 Contractors carrying out research and evaluation projects for IES frequently need to obtain expert advice and technical 
assistance from individuals and entities whose other professional work may not be entirely independent of or separable from 
the particular tasks they are carrying out for the IES contractor. Contractors endeavor not to put such individuals or entities in 
positions in which they could bias the analysis and reporting of results, and their potential conflicts of interest are disclosed. 



IV 



Disclosure of Potential Conflict of Interests 





Contents 



Page 

Acknowledgements iii 

Disclosure of Potential Conflicts of Interests iv 

Executive Summary xv 

The Reading First Program xvi 

The Reading First Impact Study xvii 

Research Design xvii 

Study sample xviii 

Data Collection Schedule and Measures xviii 

Average Impacts on Classroom Reading Instruction, Key Components of Scientifically 

Based Reading Instruction, and Student Reading Achievement xxii 

Exploratory Analyses of Variations in Impacts and Relationships among Outcomes xxiii 

Summary xxvii 

Chapter One: Overview of the Reading First Impact Study 1 

Reading First Program 1 

Conceptual Model 3 

Research Questions and Design 5 

Study Sample 6 

Data Collection and Outcome Measures 6 

Study’s Methodological Approach 13 

Approach to Estimating Impacts 13 

Statistical Significance 14 

Roadmap to this Report 1 5 

Chapter Two: Impact Findings 17 

Average Impacts on Reading Instruction 17 

Average Impacts on Student Engagement with Print 21 

Average Impacts on Key Components of SBRI 21 

Average Impacts on Reading Achievement 24 

Average Impacts on Reading Comprehension 24 

Average Impacts on Decoding Skills for Students in Grade One in Spring 2007 24 

Summary 27 



Contents 



v 





Contents (continued) 



Page 

Chapter Three: Exploratory Analyses of Variations in Impacts and Relationships Among 

Outcomes 29 

Variation in Impacts 29 

Variation in Impacts Over Time 29 

Variation in Impacts on Reading Comprehension Associated with Student 

Exposure to Reading First Schools 30 

Variation in Impacts Across Sites 34 

Exploring the Relationship between Classroom Reading Instruction and Student 

Achievement 39 

Caveats 43 

Estimation Model 43 

Findings 45 

Summary 56 

Summary 58 

Appendix A: State and Site Award Data A-l 

Appendix B: Methods B-l 

Part 1 : Regression Discontinuity Design B-l 

Approach B-l 

Part 2: Estimation Methods B-9 

Part 3: Approach to Multiple Flypothesis Testing B-l 2 

Stage 1 : Creating a Parsimonious List of Outcomes and Subgroups and 

Prioritizing Key Outcomes B- 1 3 

Stage 2: Conducting Composite Tests to Qualify Specific Flypothesis Tests B-14 

Reading Comprehension B-14 

Classroom Instruction B-20 

Student Engagement with Print B-20 

Implementation of Key Components of Scientifically Based Reading Instruction 

(Surveys) B-20 

Part 4: Statistical Precision B-21 

Part 5 : Flandling Missing Data B-23 

Surveys B-23 

Classroom Observations: IPR1 B-24 

Classroom Observations: STEP B-24 

Student Reading Achievement: SAT 10 Reading Comprehension Subtest B-25 

Student Reading Achievement: TOSWRF B-25 



VI 



Contents 





Contents (continued) 



Page 

Appendix C: Measures C-l 

Part 1 : Reading Coach and Teacher Surveys C-l 

Description of the Instruments C-l 

Administration Procedures and Response Rates C-l 

Composition, Scale, Internal Consistency and Scientifically Based Research 

Support C-4 

Part 2: Classroom Instruction: The Instructional Practice in Reading Inventory (IPRI) C-4 

Background C-4 

Overview of the IPRI C-10 

Structure of the IPRI Instrument C-l 1 

Training and Inter-rater Reliability of Classroom Observers C-l 2 

Data Collection C-l 3 

Creation of Analytic Variables C-l 5 

Field Reliability of the IPRI C-22 

Part 3: Global Appraisal of Teaching Strategies (GATS) C-29 

Part 4: Student Time-on-Task and Engagement with Print (STEP) C-30 

Data Collection and Response Rates for Fall 2005, Spring 2006, Fall 2006, and 

Spring 2007 C-31 

Analytic Variables C-32 

STEP Reliability C-32 

Part 5: Reading Achievement C-34 

Reading Comprehension C-35 

Decoding C-36 

Data Collection and Response Rates C-37 

Part 6: Data Collection Instruments C-40 

Appendix D: Confidence Intervals D-l 

Appendix E: Analyses of Impacts and Trends Over Time E-l 

Part 1 : Additional Exhibits of Separate Impact Estimates for Each Follow-up Y ear and 

Pooled E-l 

Part 2: Student Achievement Trends Over Time E-5 

Part 3: Reading Achievement on State Tests E-7 

Data E-8 

Analysis E-8 

Results E-8 

Appendix F: Analysis of Student Exposure to Reading First F-l 

Variation in Impacts on Reading Comprehension Based on Student Exposure F-l 



Contents vii 





Contents (continued) 



Page 

Appendix G: Subgroup Analyses G-l 

Part 1 : Subgroup Impacts over Time G-l 

Part 2: Linear Interactions between Program Impacts and Site Characteristics G-5 

Part 3 : Impact Estimates for Subgroups Defined by Site Characteristics G- 1 0 

Award Date G- 1 0 

Fall 2004 Reading Performance of the non-Reading First Schools G-10 

Reading First Funding Per Student G-l 1 

References R-l 



viii 



Contents 





List of Exhibits 



Page 

Exhibit ES. 1 Data Collection Schedule for the Reading First Impact Study xix 

Exhibit ES.2 Description of Domains, Outcome Measures, and Data Sources Utilized in the 

Reading First Impact Study xx 

Exhibit ES.3 Estimated Impacts on Reading Comprehension, Instruction, and Percentage of 

Students Engaged with Print: 2005, 2006, and 2007 (pooled) xxv 

Exhibit ES.4 Estimated Impacts on Key Components of Scientifically Based Reading 

Instruction (SBRI): Spring 2007 xxvi 

Exhibit ES.5 Estimated Impacts of Reading First on Decoding Skill: Grade One, Spring 2007 xxvii 

Exhibit 1 . 1 Conceptual Framework for the Reading First Program: From Legislation and 

Funding to Program Implementation and Impact 4 

Exhibit 1 .2 Data Collection Schedule for the Reading First Impact Study 7 

Exhibit 1.3 Summary of RFIS Data Collection Activities and Respective Response Rates, By 

Grade 8 

Exhibit 1 .4 Description of Domains, Outcome Measures, and Data Sources Utilized in the 

Reading First Impact Study 1 1 

Exhibit 2.1 Estimated Impacts on Instructional Outcomes: 2005, 2006, and 2007 (pooled) 18 

Exhibit 2.2 Estimated Impacts On the Number of Minutes in Instruction in Each of the Five 

Dimensions of Reading: 2005, 2006, and 2007 (pooled) 19 

Exhibit 2.3 Estimated Impacts on the Percentage of Students Engaged with Print: 2006 and 

2007 20 

Exhibit 2.4 Estimated Impacts on Key Components of Scientifically Based Reading 

Instruction (SBRI): Spring 2007 22 

Exhibit 2.5 Estimated Impacts on Reading Comprehension: Spring 2005, 2006, and 2007 

(Pooled) 25 

Exhibit 2.6 Estimated Impacts of Reading First on Decoding Skill: Grade One, Spring 2007 26 

Exhibit 3.1 Estimated Impacts on Instructional Outcomes: 2005, 2006, and 2007, and Pooled 31 



Contents 



IX 





List of Exhibits (continued) 



Page 

Exhibit 3.2 Change Over Time in Program Impact on Reading Comprehension and 

Instruction 32 

Exhibit 3.3 Estimated Impacts on Reading Comprehension: Spring 2005, 2006, and 2007, 

and Pooled 33 

Exhibit 3.4 Estimated Impacts of Reading First on the Reading Comprehension of Students 

With Three Years of Exposure: Spring 2005-Spring 2007 35 

Exhibit 3.5 Fixed Effect Impact Estimates for Instruction, by Site, by Grade 36 

Exhibit 3.6 Fixed Effect Impact Estimates for Reading Comprehension, by Site, by Grade 37 

Exhibit 3.7 F-Test of Variation in Impacts Across Sites 38 

Exhibit 3.8 Estimated Impacts on Classroom Instruction: 2005, 2006, and 2007 (pooled), by 

Award Status 40 

Exhibit 3.9 Estimated Impacts on Reading Comprehension: Spring 2005, 2006, and 2007 

(pooled), by Award Status 41 

Exhibit 3.10 Award Group Differences in Estimated Impacts on Reading Comprehension and 

Classroom Instruction: 2005, 2006, and 2007 (pooled) 42 

Exhibit 3.11 Descriptive Statistics 46 

Exhibit 3.12 Bivariate Correlation Coefficients between Test Scores and Predictors 47 

Exhibit 3.13 Regression Coefficients for the Relationship between Classroom Reading 

Instruction and Reading Comprehension 49 

Exhibit 3.14 Regression Coefficients Between Classroom Reading Instruction and Reading 

Comprehension by Treatment Status — Grade 1 50 

Exhibit 3.15 Regression Coefficients Between Classroom Reading Instruction and Reading 

Comprehension by Treatment Status — Grade 2 51 

Exhibit 3.16 Regression Coefficients Between Broadly Defined Measures of Classroom 

Instruction and Reading Comprehension 52 

Exhibit 3.17 Regression Coefficients Between Broadly Defined Measures of Classroom 

Instruction and Reading Comprehension by Grade and Treatment Status 54 



x 



Contents 





List of Exhibits (continued) 



Page 

Exhibit 3.18 Regression Coefficients Between All Predictors and Reading Comprehension 55 

Exhibit 3.19 Regression Coefficients Between All Predictors and Reading Comprehension by 

Treatment Status 57 

Exhibit A. 1 Award Date by Site in Order of Date when Reading First Funds Were First Made 

Available for Implementation A- 1 

Exhibit B. 1 Regression Discontinuity Analysis for a Hypothetical School District B-2 

Exhibit B.2 Numbers, Ratings, and Cut-points for Selection of Reading First and Reading 
First Impact Study Schools, by Site (Initial Sample for 17 Sites, Excluding 
Random Assignment Site) B-4 

Exhibit B.3 RFIS Sample Selection: From Regression Discontinuity Design Target Sample to 

Analytic Sample B-6 

Exhibit B.4 Observed Differences in Baseline Characteristics of Schools in the Study 

Sample: 2002-2003 B-7 

Exhibit B.5 Estimated Residual Differences in Baseline Characteristics of Schools in the 

Study Sample: 2002-2003 B-8 

Exhibit B.6 Outcome Tiers for the Reading First Impact Analysis B-15 

Exhibit B.7 Summary of Impacts and Results of Composite Tests B- 1 8 

Exhibit B.8 Minimal Detectable Effects for Full Sample Impact Estimates B-22 

Exhibit C.l Survey Data Collection: School, Reading Coach, and Teacher Sample 

Information C-2 

Exhibit C.2 Composition, Metric, Specifications, and Internal Consistency of Survey 

Outcomes C-5 

Exhibit C.3 Reading First Legislative Support and Guidance for Survey Outcomes C-7 

Exhibit C.4 Examples of Instruction in the Five Dimensions of Reading Instruction C-9 

Exhibit C.5 IPRI Data Collection: School, Classroom, and Observation Sample Information C-14 

Exhibit C.6 Composite of Classroom Constructs C-19 



Contents 



XI 





List of Exhibits (continued) 



Page 

Exhibit C.7 Unconditional HLM Models to Estimate Pseudo-ICCs (pi ) and True Variance 

Across Classrooms (p2) C-24 

Exhibit C.8 Average Correlation Between Paired Observers’ Codes Across Classrooms C-26 

Exhibit C.9 Main and Interaction Effects in a (r: c)*i Design C-27 

Exhibit C. 10 Calculating Variance Components for a (r: c)*i Design C-28 

Exhibit C.l 1 Generalizability Coefficients Estimated from the Co-Observation Data C-29 

Exhibit C. 12 Prototypical STEP Observation in One Classroom C-3 1 

Exhibit C. 13 STEP Data Collection: School, Classroom, and Observation Sample Information C-33 

Exhibit C. 14 Percent Correct by Code and Overall for STEP Reliability Tape, Fall 2006 C-34 

Exhibit C.l 5 Features of SAT 10: Reading/Listening Comprehension for Spring 

Administration C-36 

Exhibit C. 16 Student Assessment Data Collection: Sample School and Student Information C-3 8 

Exhibit C.l 7 Reading Coach Survey C-40 

Exhibit C. 1 8 Teacher Survey C-52 

Exhibit C.l 9 Instructional Practice in Reading Inventory (IPRI) C-70 

Exhibit C.20 Global Appraisal of Teaching Strategies C-72 

Exhibit C.21 Student Time-on-Task and Engagement with Print (STEP) Instrument C-73 

Exhibit D.l Confidence Intervals for Estimated Impacts on Reading Comprehension and 

Decoding Skills: Spring 2005, 2006, and 2007 D-2 

Exhibit D.2 Confidence Intervals for Estimated Impacts on Instructional Outcomes: Spring 

2005, Fall 2005, Spring 2006, Fall 2006 and Spring 2007 D-3 

Exhibit D.3 Confidence Intervals for Estimated Impacts on Time Spent in Instruction in the 
Five Dimensions: Spring 2005, Fall 2005, Spring 2006, Fall 2006, and Spring 
2007 D-4 



xii Contents 





List of Exhibits (continued) 



Page 

Exhibit D.4 Confidence Intervals for Estimated Impacts on Student Engagement with Print: 

Fall 2005, Spring 2006, Fall 2006, and Spring 2007 D-5 

Exhibit E.l Estimated Impacts on the Number of Minutes of Instruction in Each of Five 

Dimensions of Reading in First Grade: 2005, 2006, and 2007, and Pooled E-2 

Exhibit E.2 Estimated Impacts On the Number of Minutes in Instruction in Each of Five 

Dimensions of Reading in Second Grade: 2005, 2006, and 2007, and Pooled E-3 

Exhibit E.3 Estimated Impacts on the Percentage of Students Engaged with Print: 2006 and 

2007, and Pooled E-4 

Exhibit E.4 Reading Comprehension Means: Spring 2005, Spring 2006, and Spring 2007 E-6 

Exhibit E.5 Reading Comprehension Means: Spring 2005, Spring 2006, and Spring 2007 E-7 

Exhibit E.6 Estimated Impacts of Reading First on Grade 3 State Reading/ELA Tests and 

SAT 10 Reading Comprehension Subtest: 2006 E-9 

Exhibit F. 1 Percentage of Third Graders in Same Treatment Status for Three Y ears by Site 

and Treatment Status F-2 

Exhibit F.2 Estimated Regression Adjusted and Unadjusted Impacts of Reading First on the 
Percent of Students With Three Years of Exposure to the Same Treatment Status, 

Spring 2005-Spring 2007 F-3 

Exhibit F.3 Estimated Impacts of Reading First on the Reading Comprehension of Students 

With Three Years of Exposure: Spring 2005-Spring 2007 F-4 

Exhibit G. 1 Estimated Impacts on Reading Comprehension and Minutes in the Five 

Dimensions, by Implementation Year, Calendar Year, and Award Status G-2 

Exhibit G.2 Change Over Time in Program Impact on Reading Comprehension and 

Instruction, By Award Status G-3 

Exhibit G.3 Estimated Impacts on Classroom Instruction: 2005, 2006, and 2007 (pooled), by 

Award Status G-6 

Exhibit G.4 Estimated Impacts on Reading Comprehension: Spring 2005, 2006, and 2007 

(pooled), by Award Status G-7 



xiii 



Contents 





List of Exhibits (continued) 



Page 

Exhibit G.5 Award Group Differences in Estimated Impacts on Reading Comprehension and 

Classroom Instruction: 2005, 2006, and 2007 (pooled) G-8 

Exhibit G.6 Change in Impact Associated with One Unit of Change In Continuous 

Dimensions G-9 

Exhibit G.7 Characteristics of Early and Late Award Sites G-10 

Exhibit G.8 Estimated Impacts on Reading Comprehension, by Award Status G-13 

Exhibit G.9 Estimated Impacts on Reading Instruction, by Award Status G-14 

Exhibit G.10 Estimated Impacts on Reading Comprehension, by Fall 2004 Reading 

Performance of the non- Reading First Schools G-15 

Exhibit G. 1 1 Estimated Impacts on Reading Instruction, by Fall 2004 Reading Performance of 

the Non-Reading First Schools G-16 

Exhibit G.12 Estimated Impacts on Reading Comprehension, by Reading First Funds Per 

Student G- 1 7 

Exhibit G. 13 Estimated Impacts on Reading Instruction, by Reading First Funds Per Student G- 1 8 



XIV 



Contents 





Executive Summary 

This report presents findings from the third and final year of the Reading First Impact Study (RFIS), a 
congressionally mandated evaluation of the federal government’s $1.0 billion-per-year initiative to help 
all children read at or above grade level by the end of third grade. The No Child Left Behind Act of 2001 
(PL 107-110, Title I, Part B, Subpart 1) established Reading First (RF) and mandated its evaluation. This 
evaluation is being conducted by Abt Associates and MDRC with collaboration from RMC Research, 
Rosenblum-Brigham Associates, Westat, Computer Technology Services, DataStar, Field Marketing 
Incorporated, and Westover Consulting, under the oversight of the U.S. Department of Education, 

Institute of Education Sciences (IES). 

This report examines the impact of Reading First funding on 248 schools in 13 states and includes 17 
school districts and one statewide program for a total of 18 sites. The study includes data from three 
school years: 2004-05, 2005-06 and 2006-07. 

The Reading First Impact Study was commissioned to address the following questions: 

1) What is the impact of Reading First on student reading achievement? 

2) What is the impact of Reading First on classroom instruction? 

3) What is the relationship between the degree of implementation of scientifically based reading 
instruction and student reading achievement? 

The primary measure of student reading achievement was the Reading Comprehension subtest from the 
Stanford Achievement Test — 10 (SAT 10), given to students in grades one, two, and three. A secondary 
measure of student reading achievement in decoding was given to students in first grade. The measure of 
classroom reading instruction was derived from direct observations of reading instruction, and measures 
of program implementation were derived from surveys of educational personnel. Findings related to the 
first two questions are based on results pooled across the study’s three years of data collection (2004-05, 
2005-06, and 2006-07) for classroom instruction and reading comprehension, results from first grade 
students in one school year (spring 2007) for decoding, and aspects of program implementation from 
spring 2007 surveys. Key findings are as follows: 

• Reading First produced a positive and statistically significant impact on amount of 
instructional time spent on the five essential components of reading instruction promoted by 
the program (phonemic awareness, phonics, vocabulary, fluency, and comprehension) in 
grades one and two. The impact was equivalent to an effect size of 0.33 standard deviations 
in grade one and 0.46 standard deviations in grade two. 

• Reading First produced positive and statistically significant impacts on multiple practices that 
are promoted by the program, including professional development in scientifically based 
reading instruction (SBRI), support from full-time reading coaches, amount of reading 
instruction, and supports available for struggling readers. 

• Reading First did not produce a statistically significant impact on student reading 
comprehension test scores in grades one, two or three. 



Final Report: Executive Summary 



xv 





• Reading First produced a positive and statistically significant impact on decoding among first 
grade students tested in one school year (spring 2007). The impact was equivalent to an effect 
size of 0.17 standard deviations. 

Results are also presented from exploratory analyses that examine some hypotheses about factors that 
might account for the observed patterns of impacts. These analyses are considered exploratory because 
the study was not designed to provide a rigorous test of these hypotheses, and therefore the results must 
be considered as suggestive. Across different potential predictors of student outcomes, these exploratory 
analyses are based on different subgroups of students, schools, grade levels, and/or years of data 
collection. Key findings from these exploratory analyses are as follows: 

• There was no consistent pattern of effects over time in the impact estimates for reading 
instruction in grade one or in reading comprehension in any grade. There appeared to be a 
systematic decline in reading instruction impacts in grade two over time. 

• There was no relationship between reading comprehension and the number of years a student 
was exposed to RF. 

• There is no statistically significant site-to-site variation in impacts, either by grade or overall, 
for classroom reading instruction or student reading comprehension. 

• There is a positive association between time spent on the five essential components of 
reading instruction promoted by the program and reading comprehension measured by the 
SAT 1 0, but these findings are sensitive to both model specification and the sample used to 
estimate the relationship. 

The Reading First Program 

Reading First promotes instructional practices that have been validated by scientific research (No Child 
Left Behind Act, 2001). The legislation explicitly defines scientifically based reading research and 
outlines the specific activities state, district, and school grantees are to carry out based upon such research 
(No Child Left Behind Act, 2001). The Guidance for the Reading First Program provides further detail to 
states about the application of research-based approaches in reading (U.S. Department of Education, 
2002). Reading First funding can be used for: 

• Reading curricula and materials that focus on the five essential components of reading 
instruction as defined in the Reading First legislation: 1) phonemic awareness, 2) phonics, 3) 
vocabulary, 4) fluency, and 5) comprehension; 

• Professional development and coaching for teachers on how to use scientifically based 
reading practices and how to work with struggling readers; 

• Diagnosis and prevention of early reading difficulties through student screening, 
interventions for struggling readers, and monitoring of student progress. 

Reading First is an ambitious federal program, yet it is also a funding stream that combines local 
flexibility and national commonalities. The commonalities are reflected in the guidelines to states and 
districts and schools about allowable uses of resources. The flexibility is reflected in two ways: one, states 
(and districts) could allocate resources to various categories within target ranges rather than on a strictly 
formulaic basis, and two, states could make local decisions about the specific choices within given 
categories (e.g., which materials, reading programs, assessments, professional development providers, 



XVI 



Final Report: Executive Summary 





etc.). The activities, programs, and resources that were likely to be implemented across states and districts 
would therefore reflect both national priorities and local interpretations. 

Reading First grants were made available to states between July 2002 and September 2003. By April 
2007, states had awarded subgrants to 1,809 school districts, which had provided funds to 5,880 schools. 2 
Districts and schools with the greatest demonstrated need, in terms of student reading proficiency and 
poverty status, were intended to have the highest funding priority (U.S. Department of Education, 2002). 
States could reserve up to 20 percent of their Reading First funds to support staff development, technical 
assistance to districts and schools, and planning, administration and reporting. According to the program 
guidance, this funding provided “States with the resources and opportunity. . .to improve instruction 
beyond the specific districts and schools that receive Reading First subgrants.” (U.S. Department of 
Education, 2002). Districts could reserve up to 3.5 percent of their Reading First funds for planning and 
administration (No Child Left Behind Act, 2001). For the purposes of this study, Reading First is defined 
as the receipt of Reading First funding at the school level. 

The Reading First Impact Study 

Research Design 

The Reading First Impact Study uses a regression discontinuity design that capitalizes on the systematic 
processes some school districts used to allocate Reading First funds once their states had received RF 
grants. 3 A regression discontinuity design is the strongest quasi-experimental method available to produce 
unbiased estimates of program impacts. Under certain conditions, all of which are met by the present 
study, this method can produce unbiased estimates of program impacts. Within each district or site: 

1) Schools eligible for Reading First grants were rank-ordered for funding based on a 
quantitative rating, such as an indicator of past student reading performance or poverty; 4 

2) A cut-point in the rank-ordered priority list separated schools that did or did not receive 
Reading First grants, and this cut-point was set without knowing which schools would then 
receive funding; and 

3) Funding decisions were based only on whether a school’s rating was above or below its local 
cut-point; nothing superseded these decisions. 

Also, assuming that the shape of the relationship between schools’ ratings and outcomes is correctly 
modeled, once the above conditions have been met, there should be no systematic differences between 
eligible schools that did and did not receive Reading First grants (Reading First and non-Reading First 
schools respectively), except for the characteristics associated with the school ratings used to determine 
funding decisions. Controlling for differences in schools’ ratings allows one to control statistically for all 
systematic pre-existing differences between the two groups. One then can estimate the impact of Reading 
First by comparing the outcomes for Reading First schools and non-Reading First schools in the study 



' Data were obtained from the SEDL website (www.sedl.org/readingfirst). 

3 Appendix A indicates when study sites first received their Reading First grants. 

4 Each study site could (and did) use different metrics to rate or rank schools; it is not necessary for all study sites to use the 
same metric. 



Final Report: Executive Summary 



XVII 





sample, controlling for differences in their ratings. Non-Reading First schools in a regression 
discontinuity analysis thereby play the same role as do control schools in a randomized experiment — it is 
their regression-adjusted outcomes that represent the best indications of what outcomes would have been 
for the treatment group (in this instance, Reading First schools) in the absence of the program being 
evaluated. 



Study Sample 

The study sample was selected purposively to meet the requirements of the regression discontinuity 
design by selecting a sample of sites that had used a systematic rating or ranking process to select their 
Reading First school grantees. Within these sites, the selection of schools focused on schools as close to 
the site-specific cut-points as possible in order to obtain schools that were as comparable as possible in 
the treatment and comparison groups. 

The study sample includes 18 study sites: 17 school districts and one state-wide program. Sixteen districts 
and one state-wide program were selected from among 28 districts and one state-wide program that had 
demonstrably met the three criteria listed above. One other school district agreed to randomly assign some 
of its eligible schools to Reading First or a control group. The final selection reflected wide variation in 
district characteristics and provided enough schools to meet the study’s sample size requirements. The 
regression discontinuity sites provide 238 schools for the analysis, and the randomized experimental site 
provides 10 schools. Half the schools at each site are Reading First schools and half are non-Reading First 
schools: in three sites, the study sample includes all the RF schools (in that site), in the remaining 15 sites, 
the study sample includes some, but not all, of the RF schools (in that site). 

At the same time, the study deliberately endeavored to obtain a sample that was geographically diverse 
and as similar as possible to the population of all RF schools. The final study sample of 248 schools, 125 
of which are Reading First schools, represents 44 percent of the Reading First schools in their respective 
sites (at the time the study selected its sample in 2004). The study’s sample of RF schools is large, is quite 
similar to the population of all RF schools, is geographically diverse, and represents states (and districts) 
that received their RF grants across the range of RF state award dates. The average Year 1 grant for RF 
schools in the study sample ranged from about $81,790 to $708,240, with a mean of $188,782. This 
translates to an average of $601 per RF student. For more detailed information about the selection process 
and the study sample, see the study’s Interim Report (Gamse, Bloom, Kemple & Jacob, 2008). 

Data Collection Schedule and Measures 

Exhibit ES. 1 summarizes the study’s three-year, multi-source data collection plan. The present report is 
based on data for school years 2004-05, 2005-06, and 2006-07. Data collection included student 
assessments in reading comprehension and decoding, and classroom observations of teachers’ 
instructional practices in reading, teachers’ instructional organization and order, and students’ 
engagement with print. Data were also collected through surveys of teachers, reading coaches, and 
principals, and interviews of district personnel. 



xviii 



Final Report: Executive Summary 





Exhibit ES.1: Data Collection Schedule for the Reading First Impact Study 




2004-2005 


2005-2006 


2006-2007 


Data Collection Elements 


Fall 


Spring 


Fall 


Spring 


Fall 


Spring 


Student Testing 


✓ 


✓ 




✓ 




✓ 


Stanford Achievement Test, 10 th Edition 
(SAT 10) 








✓ 




✓ 


Test of Silent Word Reading Fluency 
(TOSWRF) 












✓ 


Classroom Observations 




✓ 


✓ 


✓ 


V 


✓ 


Instructional Practice in Reading 
Inventory (IPRI) 




V 


✓ 


✓ 


✓ 


✓ 


Student Time-on-Task and 
Engagement with Print (STEP) 






✓ 


✓ 


✓ 


✓ 


Global Appraisal of Teaching Strategies 
(GATS) 






✓ 


✓ 


✓ 


✓ 


Teacher, Principal, Reading Coach 
Surveys 




V 








✓ 


District Staff Interviews 












✓ 



Exhibit ES.2 lists the principal domains for the study, the outcome measures within each domain, and the 
data sources for each measure. These include: 

Student reading performance, assessed with the reading comprehension subtest of the Stanford 
Achievement Test, 10th Edition (SAT 10, Harcourt Assessment, Inc., 2004). The SAT 10 was 
administered to students in grades one, two and three during fall 2004, spring 2005, spring 2006, and 
spring 2007, with an average completion rate of 83 percent across all administrations. In the spring of 
2007 only, first grade students were assessed with the Test of Silent Word Reading Fluency (TOSWRF, 
Mather et al., 2004), a measure designed to assess students’ ability to decode words from among strings 
of letters. The average completion rate was 86 percent. Three outcome measures of student reading 
performance were created from SAT 10 and TOSWRF data. 

Classroom reading instruction, assessed in first-grade and second-grade reading classes through an 
observation system developed by the study team called the Instructional Practice in Reading Inventory 
(IPRI). Observations were conducted during scheduled reading blocks in each sampled classroom on two 
consecutive days during each wave of data collection: spring 2005, fall 2005 and spring 2006, and fall 
2006 and spring 2007. The average completion rate was 98 percent across all years. The IPRI, which is 
designed to record instructional behaviors in a series of three-minute intervals, can be used for 
observations of varying lengths, reflecting the fact that schools’ defined reading blocks can and do vary. 
Most reading blocks are 90 minutes or more. Eight outcome measures of classroom reading instruction 
were created from IPRI data to represent the components of reading instruction emphasized by the 
Reading First legislation. 5 Six of these measures are reported in terms of the amount of time spent on the 



5 For ease of explication, the measures created from IPRI data are referred to as the five dimensions of reading instruction (or 
“the five dimensions”) throughout the report. References to the programmatic emphases as required by legislation are labeled 
as the five essential components of reading instruction. 



Final Report: Executive Summary 



XIX 







Exhibit ES.2: Description of Domains, Outcome Measures, and Data Sources Utilized in the 
Reading First Impact Study 



Domain 


Outcome Measure and Description 


Source 


Student reading 
performance 


Mean scaled scores for 1 st , 2 nd , and 3 rd grade students, represented 
as a continuous measure of student reading comprehension. Because 
scaled scores are continuous across grade levels, values for all three 
grade levels can be shown on a single set of axes. 


Stanford 

Achievement Test, 
10 ,h Edition (SAT 
10) 




Percentage of 1 st , 2 nd , and 3 rd grade students at or above grade level, 

based upon established test norms that correspond to grade level 
performance, by grade and month. The on or above grade level 
performance percentages were based on the start of the school year, date 
of the test and the scaled score, as well as the related grade equivalent. 


Stanford 

Achievement Test, 
10 ,h Edition (SAT 
10) 




Mean standard scores for 1 st grade students, represented as a 
continuous measure of first grade students’ decoding skill. 


Test of Silent Word 
Reading Fluency 


Classroom 

reading 

instruction 


Minutes of instruction in phonemic awareness, or how much 
instructional time 1 st and 2 nd grade teachers spent on phonemic 
awareness. 


RFIS Instructional 
Practice in Reading 
Inventory 




Minutes of instruction in phonics, or how much instructional time 1 st 
and 2 nd grade teachers spent on phonics. 


RFIS IPRI 




Minutes of instruction in fluency building, or how much instructional 
time 1 st and 2 nd grade teachers spent on fluency building. 


RFIS IPRI 




Minutes of instruction in vocabulary development, or how much 
instructional time 1 st and 2 nd grade teachers spent on vocabulary 
development. 


RFIS IPRI 




Minutes of instruction in comprehension, or how much instructional 
time 1 st and 2 nd grade teachers spent on comprehension of connected 
text. 


RFIS IPRI 




Minutes of instruction in all five dimensions combined, or how much 
instructional time 1 st and 2 nd grade teachers spent on all five dimensions 
combined. 


RFIS IPRI 




Proportion of each observation with highly explicit instruction, or the 

proportion of time spent within the five dimensions when teachers used 
highly explicit instruction (e.g., instruction included teacher modeling, clear 
explanations, and the use of examples). 


RFIS IPRI 




Proportion of each observation with high quality student practice, or 

the proportion of time spent within the five dimensions when teachers 
provided students with high quality student practice opportunities (e.g., 
teachers asked students to practice such word learning strategies as 
context, word structure, and meanings). 


RFIS IPRI 


Student 
engagement 
with print 


Percentage of 1 st and 2 nd grade students engaged with print, 

represented as the per-classroom average of the percentage of students 
engaged with print across three sweeps in each classroom during 
observed reading instruction. 


RFIS Student 
Time-on-Task and 
Engagement with 
Print (STEP) 



xx 



Final Report: Executive Summary 






Exhibit ES.2: Description of Domains, Outcome Measures, and Data Sources Utilized in the 
Reading First Impact Study (continued) 



Domain 


Outcome Measure and Description 


Source 


Professional 
development in 
scientifically 
based reading 
instruction 


Amount of PD in reading received by teachers, or teachers’ self- 
reported number of hours of professional development in reading during 
2006-07. 


RFIS Teacher 
Survey 


Teacher receipt of PD in the five essential components of reading 
instruction, or the number of essential components teachers reported 
were covered in professional development they received during 2006-07. 


RFIS Teacher 
Survey 


Teacher receipt of coaching, or whether or not a teacher reported 
receiving coaching or mentoring from a reading coach in reading 
programs, materials, or strategies in 2006-07. 


RFIS Teacher 
Survey 


Amount of time dedicated to serving as K-3 reading coach, or reading 
coaches’ self-reported percentage of time spent as the K-3 reading coach 
for their school in 2006-07. 


RFIS Reading 
Coach Survey 


Amount of 

reading 

instruction 


Minutes of reading instruction per day, or teachers’ reported average 
amount of time devoted to reading instruction per day over the prior week. 


RFIS Teacher 
Survey 


Supports for 

struggling 

readers 


Availability of differentiated instructional materials for struggling 

readers, or whether or not schools reported that specialized instructional 
materials beyond the core reading program were available for struggling 
readers. 


RFIS Reading 
Coach and 
Principal Surveys 


Provision of extra classroom practice for struggling readers, or the 

number of dimensions in which teachers reported providing extra practice 
opportunities for struggling students in the past month. 


RFIS Teacher 
Survey 


Use of 

assessments 


Use of assessments to inform classroom practice, or the number of 
instructional purposes for which teachers reported using assessment 
results. 


RFIS Teacher 
Survey 



various dimensions of instruction. Two of these measures are reported in terms of the proportion of the 
intervals within each observation . 

Student engagement with print. Beginning in fall 2005, the study conducted classroom observations 
using the Student Time-on-Task and Engagement with Print (STEP) instrument to measure the percentage 
of students engaged in academic work who are reading or writing print. The STEP observation was 
completed by recording a time-sampled “snapshot” of student engagement three times in each observed 
classroom, for a total of three such “sweeps” during each STEP observation. The STEP was used to 
observe classrooms in fall 2005, spring 2006, fall 2006, and spring 2007, with an average completion rate 
of 98 percent across all years. One outcome measure was created using STEP data. 

Professional development in scientifically based reading instruction , amount of reading instruction, 
supports for struggling readers, and use of assessments. Within these four domains, eight outcome 
measures were created based on data from surveys of principals, reading coaches, and teachers about 
school and classroom resources. The eight outcome measures represent aspects of scientifically based 
reading instruction promoted in the Reading First legislation and guidance. Surveys were fielded in spring 
2005 and again in spring 2007 with an average completion rate across all respondents of 73 percent in 
spring 2005 and 86 percent in spring 2007. This final report includes findings from 2007 surveys only. 



Final Report: Executive Summary 



XXI 






Additional data were collected by the study team in order to create measures used in correlational 
analyses. These data include: 

The Global Appraisal of Teaching Strategies (GATS), a 12-item checklist designed to measure teachers’ 
instructional strategies related to overall instructional organization and order, is adapted from The 
Checklist of Teacher Competencies (Foorman and Schatschneider, 2003). Unlike the IPRI, which focuses 
on discrete teacher behaviors, the GATS was designed to capture global classroom management and 
environmental factors. Items covered topics such as the teacher’s organization of materials, lesson 
delivery, responsiveness to students, and behavior management. The GATS was completed by the 
classroom observer immediately after each IPRI observation, meaning that each sampled classroom was 
rated on the GATS twice in the fall and twice in the spring in both the 2005-2006 school year and the 
2006-2007 school year. The GATS was fielded in fall 2005, spring 2006, fall 2006, and spring 2007, with 
an average completion rate of over 99 percent. A single measure from the GATS data was created for use 
in correlational analyses. 

Average Impacts on Classroom Reading Instruction, Key Components 
of Scientifically Based Reading Instruction, and Student Reading 
Achievement 



Exhibit ES.3 reports average impacts on classroom reading instruction and student reading 
comprehension pooled across school years 2004-05 and 2005-06 and 2006-07. 6 Exhibit ES.4 reports 
average impacts on key components of scientifically based reading instruction from spring 2007. Exhibit 
ES.5 reports the average impact on first graders’ decoding skills from spring 2007. Impacts were 
estimated for each study site and averaged across sites in proportion to their number of Reading First 
schools in the sample. Average impacts thus represent the typical study school. On average: 

• Reading First had a statistically significant impact on the total time that teachers spent on the 
five essential components of reading instruction promoted by the program in grades one and 
two. 

• Reading First had a statistically significant impact on the use of highly explicit instruction in 
grades one and two and on the amount of high quality student practice in grade two. Its 
estimated impact on high quality student practice for grade one was not statistically 
significant. 

• Reading First had no statistically significant impacts on student engagement with print. 

• Reading First had a statistically significant impact on the amount of professional 
development in reading teachers reported receiving; teachers in RF schools reported receiving 
25.8 hours of professional development compared to what would have been expected without 
Reading First (13.7 hours). The program also had a statistically significant impact on 
teachers’ self-reported receipt of professional development in the five essential components 
of reading instruction; teachers in RF schools reported receiving professional development on 
an average of 4.3 of 5 components, compared to what would have been expected without 
Reading First (3.7 components). 



Except for student engagement with print (STEP), which is pooled across the 2005-06 and 2006-07 school years only. 



XXII 



Final Report: Executive Summary 





• A statistically significantly greater proportion (20 percent) of teachers in RF schools reported 
receiving coaching from a reading coach than would be expected without Reading First. The 
program also had a statistically significant impact on the amount of time reading coaches 
reported spending in their role as the school’s reading coach; coaches in RF schools reported 
spending 91.1 percent of their time in this role, 33.5 percentage points more than would be 
expected without Reading First (57.6 percent). 

• Reading First had a statistically significant impact on the amount of time teachers reported 
spending on reading instruction per day. Teachers in RF schools reported an average of 105.7 
minutes per day, 18.5 minutes more than the 87.2 minutes that would be expected without 
Reading First. 

• Reading First had a statistically significant impact on teachers’ provision of extra classroom 
practice in the essential components of reading instruction in the past month; the impact was 
0.2 components. 

• There were no statistically significant impacts of Reading First on the availability of 
differentiated instructional materials for struggling readers or on teachers’ reported use of 
assessments to inform classroom practice for grouping, diagnostic, and progress monitoring 
purposes. 

• Reading First had no statistically significant impact on students’ reading comprehension 
scaled scores or the percentages of students whose reading comprehension scores were at or 
above grade level in grades one, two or three. The average first, second, and third grade 
student in Reading First schools was reading at the 44 th , 39 th , and 39 th percentile respectively 
on the end-of-the-year assessment (on average over the three years of data collection). 

• Reading First had a positive and statistically significant impact on average scores on the 
TOSWRF, a measure of decoding skill, equivalent to 2.5 standard score points, or an effect 
size of 0.17 standard deviations (See Exhibit ES.5). Because the test of students’ decoding 
skills was only administered in a single grade and a single year, it is not possible to provide 
an estimate of Reading First’s overall impact on decoding skills across multiple grades and 
across all three years of data collection, as was done for reading comprehension. 

Exploratory Analyses of Variations in Impacts and Relationships 
among Outcomes 

This report also presents results from exploratory analyses that examine some hypotheses about factors 
that might account for the pattern of observed impacts presented above. These exploratory analyses are 
based on analyses of subgroups of students, schools, grade levels, and/or years of data collection. The 
information is provided as possible avenues for further exploration or for improving Reading First or 
programs like Reading First. Flowever, the study was not designed to provide a rigorous test of these 
hypotheses, and therefore the results are only suggestive. Findings from these exploratory analyses 
include the following: 

• Data collected during three school years (2004-05, 2005-06 and 2006-07) were used to 
examine variation over time in program impacts. No consistent pattern of differential impacts 
over time was established. 

• No relationship was found between the number of years a student was exposed to RF and 
student reading achievement. 



xxiii 



Final Report: Executive Summary 





• There was no statistically significant variation in impacts across sites in the study, either by 
grade or overall, for reading instruction or for reading comprehension. 

• Correlational analyses, which are outside the causal framework of the main impact analyses 
presented in the report, indicate a positive and statistically significant association between 
time spent on the five essential components of reading instruction promoted by the program 
and students’ reading comprehension. A one-minute increase in time devoted to instruction in 
the five dimensions per daily reading block was associated with a 0.07 point increase in 
scaled score points in first grade, and a 0.06 point increase in second grade. This relationship 
does not hold for models that include other potential mediators of student achievement. 
However, due to data limitations, these latter models could only be run on a subset of the 
data; thus, we do not know whether the differences in the findings across models are due to 
changes in the sample or changes in the model specification itself. 



XXIV 



Final Report: Executive Summary 





Exhibit ES.3: Estimated Impacts on Reading Comprehension, Instruction, and Percentage of 
Students Engaged with Print: 2005, 2006, and 2007 (pooled) 1 





Actual 

Mean 

with 

Reading 

First 


Estimated 

Mean 

without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Instruction 

Number of minutes of instruction in the five 
components combined 
Grade 1 


59.23 


52.31 


6.92* 


0.33* 


(0.005) 


Grade 2 


59.08 


49.30 


9.79* 


0.46* 


(<0.001 ) 


Percentage of intervals in five components 
with Highly Explicit Instruction 
Grade 1 


29.39 


26.10 


3.29* 


0.18* 


(0.018) 


Grade 2 


30.95 


27.95 


3.00* 


0.16* 


(0.040) 


Percentage of intervals in five components 
with High Quality Student Practice 
Grade 1 


18.44 


17.61 


0.82 


0.05 


(0.513) 


Grade 2 


17.82 


14.88 


2.94* 


0.16* 


(0.019) 


Reading Comprehension 

Scaled Score 
Grade 1 


543.8 


539.1 


4.7 


0.10 


(0.083) 


Grade 2 


584.4 


582.8 


1.7 


0.04 


(0.462) 


Grade 3 


609.1 


608.8 


0.3 


0.01 


(0.887) 


Percent Reading At or Above Grade Level 
Grade 1 


46.0 


41.8 


4.2 




(0.104) 


Grade 2 


38.9 


37.3 


1.6 


- 


(0.504) 


Grade 3 


38.7 


38.8 


-0.1 


- 


(0.973) 


Percentage of Students Engaged with Print 

Grade 1 


47.84 


42.52 


5.33 


0.18 


(0.070) 


Grade 2 


50.53 


55.27 


-4.75 


-0.17 


(0.104) 



NOTES: 

The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) 
located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. For grade 2 in 2006, one 
non-RF school could not be included in the analysis because test score data were not available. For grade 3 in 2007, one RF 
school could not be included in the analysis because test score data were not available. 

Impact estimates are statistically adjusted (e.g., take each school’s rating, site-specific funding cut-point, and other covariates 
into account) to reflect the regression discontinuity design of the study. 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools 
absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

'Except for STEP, which is pooled across 2006 and 2007 school years only. 

EXHIBIT READS: The observed mean amount of time spent per daily reading block in instruction in the five 
components combined for first grade classrooms with Reading First was 59.23 minutes. The estimated mean amount 
of time without Reading First was 52.31 minutes. The impact of Reading First on the amount of time spent in 
instruction in the five components combined was 6.92 (or 0.33 standard deviations), which was statistically significant 
(p=.005). 

SOURCES: RFIS SAT 10 administrations in the spring of 2005, 2006, and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR): RFIS Instructional 
Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007; RFIS Student Time-on-Task 
and Engagement with Print, fall 2005, spring 2006, fall 2006, and spring 2007. 



Final Report: Executive Summary 



xxv 







Exhibit ES.4: Estimated Impacts on Key Components of Scientifically Based Reading 
Instruction (SBRI): Spring 2007 



Domain 


Actual 

Mean 

With 

Reading 

First 


Estimated 

Mean 

Without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Professional Development (PD) in SBRI 

Amount of PD in reading received by teachers 
(hours) 3 


25.84 


13.71 


12.13* 


0.51* 


(<0.001 ) 


Teacher receipt of PD in the five essential 
components of reading instruction (0-5) 3 


4.30 


3.75 


0.55* 


0.31* 


(0.010) 


Teacher receipt of coaching (proportion) 3 


0.83 


0.63 


0.20* 


0.41* 


(<0.001 ) 


Amount of time dedicated to serving as K-3 
reading coach (percent) b '° 


91.06 


57.57 


33.49* 


1.03* 


(<0.001 ) 


Amount of Reading Instruction 

Minutes of reading instruction per day 3 


105.71 


87.24 


18.47* 


0.63* 


(<0.001 ) 


Supports for Struggling Readers 

Availability of differentiated instructional 
materials for struggling readers (proportion) 15 


0.98 


0.97 


0.01 


0.15 


(0.661) 


Provision of extra classroom practice for 
struggling readers (0-4) 3 


3.79 


3.59 


0.19* 


0.20* 


(0.018) 


Use of Assessments 

Use of assessments to inform classroom 
practice (0-3) 3 


2.63 


2.45 


0.18 


0.19 


(0.090) 



NOTES: 

3 Classroom level outcome 
b School level outcome 

0 The response rates for RF and nonRF reading coach surveys were statistically significantly different (p=0.037). Reading 
first schools were more likely to have had reading coaches and to have returned reading coach surveys. 

d Missing data rates ranged from 0.1 to 3.3 percent for teacher survey outcomes (RF: 0.1 to 1.0 percent; non-RF: 0 to 4.9 
percent) and 1 .3 to 2.8 percent for reading coach and/or principal survey outcomes (RF : 0 to 1.6 percent; non-RF : 2.7 to 4. 1 
percent). Survey constructs (i.e., those outcomes comprised of more than one survey item) were computed only for 
observations with complete data, with one qualification: for the construct “minutes spent on reading instruction per day,” the 
mean was calculated as the total number of minutes reported for last week (over a maximum of 5 days) divided by the 
number of days with non-missing values. Only those teacher surveys with missing data for all 5 days were missing 0.9 
percent). 

The complete Reading First Impact Study sample includes 248 schools from 18 sites (17 districts and 1 state) located in 13 
states. 125 schools are Reading First schools and 123 are non-Reading First schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools. 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools 
absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The observed mean amount of professional development in reading received by teachers with 
Reading First was 25.84 hours. The estimated mean amount of professional development in reading received by 
teachers without Reading First was 13.71 hours. This impact of 12.13 hours was statistically significantly (p<.001). 

SOURCES: RFIS, Teacher, Reading Coach, and Principal Surveys, spring 2007 



XXVI 



Final Report: Executive Summary 






Exhibit ES.5: Estimated Impacts of Reading First on Decoding Skill: Grade One, Spring 2007 





Actual 

Mean 

with 

Reading 

First 


Estimated 

Mean 

without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Decoding Skill 












Standard Score 


96.9 


94.4 


2.5 * 


0.17 * 


(0.025) 


Corresponding Grade Equivalent 


1.7 


1.4 








Corresponding Percentile 


42 


35 









NOTES: 

The Test of Silent Word Reading Fluency (TOSWRF) sample includes first-graders in 248 schools from 18 sites (17 school 
districts and 1 state) located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools from spring 2007 TOSWRF test scores (1 st grade). 

The key metric for the TOSWRF analyses is the standard score, corresponding grade equivalents and percentiles are provided 
for reference. Although the publisher of the Test of Silent Word Reading Fluency states that straight comparisons between 
standard scores and grade equivalents will likely yield discrepancies due to the unreliability of the grade equivalents, they are 
provided because program criteria are sometimes based on grade equivalents (TOSWRF, Mather et al., 2004). 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools 
absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The observed mean silent word reading fluency standard score for first-graders with Reading First 
was 96.9 standard score points. The estimated mean without Reading First was 94.4 standard score points. The impact 
of Reading First was 2.5 standard score points (or 0.17 standard deviations), which was statistically significant 
(p=.025). 

SOURCES: RFIS TOSWRF administration in spring 2007 



Summary 

The findings presented in this report are generally consistent with findings presented in the study’s 
Interim Report, which found statistically significant impacts on instructional time spent on the five 
essential components of reading instruction promoted by the program (phonemic awareness, phonics, 
vocabulary, fluency, and comprehension) in grades one and two, and which found no statistically 
significant impact on reading comprehension as measured by the SAT 1 0. In addition to data on the 
instructional and student achievement outcomes reported in the Interim Report, the final report also 
presents findings based upon information obtained during the study’s third year of data collection: data 
from a measure of first grade students’ decoding skill, and data from self-reported surveys of educational 
personnel in study schools. 

Analyses of the impact of Reading First on aspects of program implementation, as reported by teachers 
and reading coaches, revealed that the program had statistically significant impacts on several domains. 
The information obtained from the Test of Silent Word Reading Fluency indicates that Reading First had 
a positive and statistically significant impact on first grade students’ decoding skill. 



Final Report: Executive Summary 



XXVII 









The final report also explored a number of hypotheses to explain the pattern of observed impacts. 
Analyses that explored the association between the length of implementation of Reading First in the study 
schools and reading comprehension scores, as well as between the number of years students had been 
exposed to Reading First instruction and reading comprehension scores were inconclusive. No 
statistically significant variation across sites in the pattern of impacts was found. Correlational analyses 
suggest that there is a positive association between time spent on the five essential components of reading 
instruction promoted by the program and reading comprehension measured by the SAT 10, but these 
findings appear to be sensitive to model specification and the sample used to estimate the relationship. 

The study finds, on average, that after several years of funding the Reading First program, it has a 
consistent positive effect on reading instruction yet no statistically significant impact on student reading 
comprehension. Findings based on exploratory analyses do not provide consistent or systematic insight 
into the pattern of observed impacts. 



xxviii 



Final Report: Executive Summary 





Chapter One: Overview of the Reading First Impact 
Study 

The No Child Left Behind Act of 2001 (NCLB) established the Reading First (RF) Program, a major 
federal initiative designed to help ensure that all children can read at or above grade level by the end of 
third grade. The RF legislation requires the U.S. Department of Education to contract with an outside 
entity to evaluate the impact of the Reading First Program. To meet this requirement, the Department 
contracted with Abt Associates in September 2003 to design and conduct the Reading First Impact Study 
(RFIS). Abt partnered with other organizations, including MDRC, RMC Research, Rosenblum-Brigham 
Associates, and Westat. 7 The RFIS is a multi-year study that encompasses data collection over the course 
of three school years: 2004-05, 2005-06, and 2006-07. 

This final report presents major findings based on data collected during the 2004-05, 2005-06, and 2006- 
07 school years. It reviews information about the study background, design, sample, and measures, and it 
updates information presented in the study’s interim report with data from the final year of data 
collection. 

Chapter One begins with an overview of the Reading First Program, describes the conceptual framework 
underlying the program and this evaluation as a whole, outlines the study’s guiding evaluation questions, 
summarizes the study design, measures, and data collection activities, and presents a roadmap for the 
remainder of the report. 

Reading First Program 

Reading First promotes instructional practices that have been validated by scientific research (No Child 
Left Behind Act, 2001). The legislation explicitly defines scientifically based reading research and 
outlines the specific activities state, district, and school grantees are to carry out based upon such research 
(No Child Left Behind Act, 2001). The Guidance for the Reading First Program provides further detail to 
states about the application of research-based approaches in reading (U.S. Department of Education, 
2002). Reading First funding can be used for: 

• Reading curricula and materials that focus on the five essential components of reading 
instruction as defined in the Reading First legislation: 1) phonemic awareness, 2) phonics, 3) 
vocabulary, 4) fluency, and 5) comprehension; 

• Professional development and coaching for teachers on how to use scientifically based 
reading practices and how to work with struggling readers; 

• Diagnosis and prevention of early reading difficulties through student screening, 
interventions for struggling readers, and monitoring of student progress. 



7 Other subcontractor organizations included: Computer Technology Services, Inc.; DataStar, Inc.; Field Marketing Inc.; Paladin 
Pictures, Inc.; and Westover Consultants, Inc. 



Final Report: Overview of the Reading First Impact Study 



1 





Reading First is an ambitious federal program, yet it is also a funding stream that combines local 
flexibility and national commonalities. The commonalities are reflected in the guidelines to states and 
districts and schools about allowable uses of resources. The flexibility is reflected in two ways: one, states 
(and districts) could allocate resources to various categories within target ranges rather than on a strictly 
formulaic basis, and two, states could make local decisions about the specific choices within given 
categories (e.g., which materials, reading programs, assessments, professional development providers, 
etc.). The activities, programs, and resources that were likely to be implemented across states and districts 
would therefore reflect both national priorities and local inteipretations. 

Reading First grants were made available to states between July 2002 and September 2003. By April 
2007, states had awarded subgrants to 1,809 school districts, which had provided funds to 5,880 schools. 8 
Districts and schools with the greatest demonstrated need, in terms of student reading proficiency and 
poverty status, were intended to have the highest funding priority (U.S. Department of Education, 2002). 
States could reserve up to 20 percent of their Reading First funds to support staff development, technical 
assistance to districts and schools, and planning, administration and reporting. According to the program 
guidance, this funding provided “states with the resources and opportunity. . .to improve instruction 
beyond the specific districts and schools that receive Reading First subgrants.” (U.S. Department of 
Education, 2002). Districts could reserve up to 3.5 percent of their Reading First funds for planning and 
administration (No Child Left Behind Act, 2001). For the purposes of this study, Reading First is defined 
as the receipt of Reading First funding at the school level. 

A key part of the evaluation is to determine the impact of Reading First on instruction in the targeted 
grades. Therefore, classroom observations of instructional practices in reading were needed from both RF 
and non-RF classrooms. Because the Reading First legislation calls for reading instruction to be based on 
scientifically based reading research findings, the RF1S observational instrument built upon findings 
describing evidence-based instructional practices such as those in the National Research Council’s report 
(Snow, Bums, and Griffin, 1998) and the National Reading Panel report (National Institute of Child 
Health and Human Development, 2000). The Reading First legislation highlights five essential 
components of reading instruction. These five components, or dimensions, of reading instruction formed 
the basis for the development of the RFIS observation instrument. 9 Each dimension is described below. 

Phonemic Awareness 

Phonemic awareness instruction teaches students to distinguish and manipulate the sounds in words. 10 A 
phoneme is the smallest unit of sound that affects the meaning of a spoken word. Before learning to read 
print, children must first understand that words are made up of component sounds. For example, changing 
the first phoneme in the word hat from /h/ to /p/ changes the word from hat to pat. Phonemic awareness 
instruction improves children’s word reading and helps children learn to spell (e.g., Ball and Blachman, 
1991; Bus and van Ijzendoom, 1999; see also NICHD, 2000). 



Data were obtained from the SEDL website (www.sedl.org/readingfirst). 

9 For ease of explication, the measures created from IPRI data are referred to as the five dimensions of reading instruction (or 
“the five dimensions”) throughout the report. References to the programmatic emphases as required by legislation are labeled 
as the five essential components of reading instruction. 

111 Phonemic awareness is a subcategory of phonological awareness. Phonological awareness includes phonemic awareness, but 
also refers to the ability to recognize and work with larger parts of spoken language, such as syllables and onsets and rimes. 



2 



Final Report: Overview of the Reading First Impact Study 





Phonics 

Phonics instruction helps children learn and understand the relationships between the letters of written 
language and the sounds (phonemes) of spoken language. Instruction in phonics helps children understand 
that there are predictable relationships between letters and sounds, helps them recognize familiar words, 
and allows children to “decode” unfamiliar printed words (see NICHD, 2000). 

Fluency Building 

Fluency is the ability to read text accurately and smoothly. The more automatically students can read 
individual words, the more they can focus on understanding the meaning of whole sentences and passages 
(NICHD, 2000). Fluency instruction helps students who are learning to read by building a bridge between 
recognizing words more efficiently and comprehending the meaning of text (e.g., Reutzel and 
Hollingsworth, 1993; also see NICHD, 2000). 

Vocabulary Development 

Oral vocabulary refers to words used in speaking or recognized in listening. Reading vocabulary refers to 
words that are recognized or used in print. Instruction for beginning readers uses oral vocabulary to help 
them make sense of the words they see, and instruction that develops their reading vocabulary allows 
them to progress to more complex texts (e.g., Beck, Perfetti and McKeown, 1982; McKeown et al., 1983; 
also see NICHD, 2000). Readers must know what words mean before they can understand what they are 
reading. 

Comprehension of Connected Text 

Comprehension is understanding what is being or has been read. Students will not understand text if they 
can read individual words, but do not understand what sentences, paragraphs, and longer passages mean. 
Proficient readers elicit meaning from — or comprehend — text, rather than simply identifying a series of 
words. Instruction in comprehension strategies provides specific tools for readers to use to make sense of 
the text they read (see NICHD, 2000). Comprehension strategies are vital to the development of 
competent readers because they aid in understanding the collective significance of words, sentences, and 
passages. 

Conceptual Model 

Exhibit 1.1 identifies the program’s central goals and specifies the pathways through which the principles 
and components of the Reading First program are hypothesized to improve reading instruction, and 
subsequently student reading achievement. This conceptual framework provides a substantive backdrop 
for the Reading First Impact Study. The Reading First Impact Study has focused primarily on Colu mn 3 
(which specifies aspects of program implementation, including necessary components of scientifically 
based reading instruction hypothesized to achieve its longer term student achievement goals) and Colu mn 
4 (which details aspects of student reading achievement). The hypothesis underlying Reading First is that 
these outcomes will only be achieved through successful implementation of appropriate research-based 
reading programs, teacher professional development, use of diagnostic assessments, and appropriate 
classroom organization and provision of supplemental services. 



Final Report: Overview of the Reading First Impact Study 



3 





Final Report: Overview of the Reading First Impact Study 



Exhibit 1.1: Conceptual Framework for the Reading First Program: From Legislation and Funding to Program Implementation and Impact 



Design and 

Legislative Specifications and Flow of Funds to Implementation of Research- Enhanced Student 

Administrative Guidelines * Eligible Schools * Based Reading Programs * Reading Achievement 



• NCLB, Title I, Part B, 
Subpart I 

• Specification of effective 
reading program 
components 

• Rules for state grant and 
district subgrant formulas 
and allocation 

• Specification for 
allowable state and 
district use of funds 

• Administrative guidelines 
for state grant application 
and district subgrant 

• Accountability and 
evaluation requirements 



SEAs submit applications for 
Reading First funds 




Awarded funds to SEAs with 
approved application 

I I 

Eligible LEAs and/or schools 
submit competitive subgrant 
proposal 

T 




SEAs and/or schools award 
subgrants to LEAs and/or schools 
with approved applications 



Funds distributed to eligible 
schools 




>■ 



• Increased proportion of 
students reading 
at/above grade level 

• Adequate mastery of 
five essential 
components of early 
reading 

• All students reading at 
grade level by the end 
of third grade 



t f f f L 

State and district policy context; existing reading programs; other resources and programs that may support reading 
















Research Questions and Design 



The Reading First Impact Study was commissioned to address the following questions: 

1) What is the impact of Reading First on student reading achievement? 

2) What is the impact of Reading First on classroom instruction? 

3) What is the relationship between the degree of implementation of scientifically based reading 
instruction and student reading achievement? 

The Reading First Impact Study uses a regression discontinuity design (RDD) that capitalizes on the 
systematic processes some school districts used to allocate Reading First funds once their states had 
received RF grants. 11 A regression discontinuity design is the strongest quasi-experimental method 
available to produce unbiased estimates of program impacts. Under certain conditions, all of which are 
met by the present study, this method can produce unbiased estimates of program impacts. Within each 
district or site: 

1) Schools eligible for Reading First grants were rank-ordered for funding based on a 
quantitative rating, such as an indicator of past student reading performance or poverty; 12 

2) A cut-point in the rank-ordered priority list separated schools that did or did not receive 
Reading First grants, and this cut-point was set without knowing which schools would then 
receive funding; and 

3) Funding decisions were based only on whether a school’s rating was above or below its local 
cut-point; nothing superseded these decisions. 

Also, assuming that the shape of the relationship between schools’ ratings and outcomes is correctly 
modeled, once the above conditions have been met, there should be no systematic differences between 
eligible schools that did and did not receive Reading First grants (Reading First and non-Reading First 
schools respectively), except for the characteristics associated with the school ratings used to determine 
funding decisions. Controlling for differences in schools’ ratings allows one to control statistically for all 
systematic pre-existing differences between the two groups. One then can estimate the impact of Reading 
First by comparing the outcomes for Reading First schools and non-Reading First schools in the study 
sample, controlling for differences in their ratings. Non-Reading First schools in a regression 
discontinuity analysis thereby play the same role as do control schools in a randomized experiment — it is 
their regression-adjusted outcomes that represent the best indications of what outcomes would have been 
for the treatment group (in this instance, Reading First schools) in the absence of the program being 
evaluated. 13 



1 1 Appendix A indicates when study sites first received their Reading First grants. 

12 Each study site could (and did) use different metrics to rate or rank schools; it is not necessary for all study sites to use the 
same metric. 

13 See Appendix B of this report and Gamse, Bloom, Kemple & Jacob (2008) for a more extended discussion of the regression 
discontinuity design, the study sample, and the study’s approach to estimating impacts. 



Final Report: Overview of the Reading First Impact Study 



5 





Study Sample 



The study sample was selected purposively to meet the requirements of the regression discontinuity 
design by selecting a sample of sites that had used a systematic rating or ranking process to select their 
Reading First school grantees. Within these sites, the selection of schools focused on schools as close to 
the site-specific cut-points as possible in order to obtain schools that were as comparable as possible in 
the treatment and comparison groups. 

The study sample includes 18 study sites: 17 school districts and one state-wide program. Sixteen districts 
and one state-wide program were selected from among 28 districts and one state-wide program that had 
demonstrably met the three criteria listed above. One other school district agreed to randomly assign some 
of its eligible schools to Reading First or a control group. The final selection reflected wide variation in 
district characteristics and provided enough schools to meet the study’s sample size requirements. The 
regression discontinuity sites provide 238 schools for the analysis, and the randomized experimental site 
provides 10 schools. Half the schools at each site are Reading First schools and half are non-Reading First 
schools: in three sites, the study sample includes all the RF schools (in that site), in the remaining 15 sites, 
the study sample includes some, but not all, of the RF schools (in that site). 

At the same time, the study deliberately endeavored to obtain a sample that was geographically diverse 
and as similar as possible to the population of all RF schools. The final study sample of 248 schools, 125 
of which are Reading First schools, represents 44 percent of the Reading First schools in their respective 
sites (at the time the study selected its sample in 2004). The study’s sample of RF schools is large, is quite 
similar to the population of all RF schools, is geographically diverse, and represents states (and districts) 
that received their RF grants across the range of RF state award dates. The average Year 1 grant for RF 
schools in the study sample ranged from about $81,790 to $708,240, with a mean of $188,782. This 
translates to an average of $601 per RF student. Nationally, the median RF grant (based on data reported 
in the 2004-05 school year) is $138,000 (U.S. Department of Education, 2006). For more detailed 
information about the selection process and the study sample, see the study’s Interim Report (Gamse, 
Bloom, Kemple & Jacob, 2008). 

Data Collection and Outcome Measures 

Exhibit 1.2 summarizes the study’s three-year, multi-source data collection plan. The present report is 
based on data for school years 2004-05, 2005-06, and 2006-07. Data collection included student 
assessments in reading comprehension and decoding, and classroom observations of teachers’ 
instructional practices in reading, teachers’ instructional organization and order, and students’ 
engagement with print. Data were also collected through surveys of teachers, reading coaches, and 
principals, and interviews of district personnel. Sample sizes and response rates for all data collection 
activities are presented in Exhibit 1.3; see Appendix C for detailed descriptions of the numbers of schools, 
classrooms, survey respondents, and students included in each separate data collection activity. See 
Appendix B, Part 5 for a discussion of how missing data were handled. 



6 



Final Report: Overview of the Reading First Impact Study 





Exhibit 1.2: Data Collection Schedule for the Reading First Impact Study 



2004-2005 2005-2006 2006-2007 

Data Collection Elements Fall Spring Fall Spring Fall Spring 



Student Testing 


✓ 


V 




✓ 






Stanford Achievement Test, 10 th Edition 
(SAT 10) 




V 




✓ 




✓ 


Test of Silent Word Reading Fluency 
(TOSWRF) 












✓ 


Classroom Observations 




V 


✓ 


V 


✓ 




Instructional Practice in Reading 
Inventory 




V 


✓ 


V 


✓ 


✓ 


Student Time-on-Task and 
Engagement with Print (STEP) 






✓ 


V 


✓ 


✓ 


Global Appraisal of Teaching Strategies 
(GATS) 






✓ 


V 


✓ 


✓ 


Teacher, Principal, Reading Coach 
Surveys 




✓ 








✓ 


District Staff Interviews 




✓ 








✓ 














Final Report: Overview of the Reading First Impact Study 



Exhibit 1.3: Summary of RFIS Data Collection Activities and Respective Response Rates, By Grade 







Fall 2004 






Spring 2005 






Fall 2005 






RF 




Non-RF 




RF 


Non-RF 


RF 




Non-RF 




N 


(%) 


N 


(%) 


N 


(%) 


N 


(%) 


N 


(%) 


N 


(%) 


Student assessments (SAT 10) 

Grade 1 


5,417 


72% 


5,139 


69% 


7,791 


84% 


7,037 


80% 










Grade 2 


5,178 


71% 


4,978 


70% 


7,519 


85% 


7,046 


82% 










Grade 3 


5,281 


73% 


4,861 


69% 


7,362 


84% 


7,014 


84% 










Student assessments (TOSWRF) 0 


























Grade 1 


























Classroom observations (IPRI) 


























Grade 1 










809 


97% 


820 


96% 


720 


98% 


704 


98% 


Grade 2 










766 


96% 


760 


95% 


664 


97% 


668 


98% 


Student engagement with print observations 
(STEP) c 

Grade 1 


















359 


99% 


349 


99% 


Grade 2 


















324 


97% 


329 


98% 


Global Appraisal of Teaching Strategies (GATS) a 

Grade 1 


















359 


99% 


351 


99% 


Grade 2 


















333 


99% 


335 


99% 


Surveys 

Grade 1 Teacher 










396 


73% 


363 


67% 










Grade 2 Teacher 










362 


73% 


319 


65% 










Grade 3 Teacher 










318 


71% 


279 


64% 










Reading Coach 










118 


95% 


79 


72% 










Principal 










98 


78% 


89 


72% 










Site/District Interviews 










18 


100% 


18 


100% 











Notes: 

a In 12 sites, the SAT 10 classroom sample mirrors the observation (and TOSWRF) classroom samples; in the remaining 6 sites, state and district testing requirements meant that all classrooms were tested. 
b The TOSWRF classroom sample mirrors the classrooms selected for classroom observations. 

c In each round of two classroom observations, the STEP was administered once while the IPRI was administered twice. 

d At the conclusion of each IPRI observation (two per classroom), the observer completed a GATS form for the classroom. Information presented here on the GATS was combined to produce a single record 
per classroom. 

Blank cells indicate no data collection for that component at that time period. Response rates shown are for the analytic sample of 248 schools. 

Active consent (i.e., only students whose parents had signed and returned consent forms) was used in fall 2004. Passive consent (i.e., all eligible students were tested unless their parents submitted forms 
refusing to allow their children to be tested) was used in subsequent test administrations. 

Reading instruction in each classroom was observed on two consecutive days in each wave of data collection. Observations of student engagement were scheduled for the same classrooms as observations of 
teachers’ reading instruction. (See Appendix C for a complete discussion of the observation protocols). 

The numbers reported here for SAT 10 student assessments differ from those in Exhibit 3.2 in the Interim Report because the Interim Report incorrectly presented the numbers of students eligible to be tested 
rather than the number of students tested. Note that the response rates (the number of students tested divided by the number of students eligible to be tested) were correct in Exhibit 3.2 in the Interim Report, 
and are reproduced here. 

EXHIBIT READS: During fall 2004, there were 5,417 student assessments completed in Reading First grade 1 classrooms, corresponding to 72 percent of all eligible student assessments. 







Final Report: Overview of the Reading First Impact Study 



Exhibit 1.3: Summary of RFIS Data Collection Activities and Respective Response Rates, By Grade (continued) 






Spring 2006 

RF Non-RF 




Fall 2006 

RF Non-RF 




Spring 2007 

RF Non-RF 




N 


(%) 


N 


(%) 


N 


(%) 


N 


(%) 


N 


(%) 


N 


(%) 


Student assessments (SAT 10) 

Grade 1 


6,522 


86% 


5,588 


85% 










6,954 


88% 


5,534 


85% 


Grade 2 


6,497 


86% 


5,596 


85% 










6,777 


90% 


5,621 


85% 


Grade 3 


6,254 


87% 


6,043 


87% 










6,172 


86% 


6,117 


86% 


Student assessments (TOSWRF) 0 

Grade 1 


















5,520 


87% 


5,272 


85% 


Classroom observations (IPRI) 


























Grade 1 


718 


99% 


707 


99% 


738 


100% 


703 


100% 


734 


99% 


708 


99% 


Grade 2 


666 


100% 


668 


100% 


684 


99% 


672 


100% 


684 


99% 


676 


100% 


Student engagement with print observations 
(STEP) c 

Grade 1 


351 


97% 


347 


98% 


366 


99% 


343 


97% 


361 


98% 


349 


97% 


Grade 2 


326 


97% 


330 


99% 


339 


98% 


332 


99% 


341 


99% 


333 


98% 


Global Appraisal of Teaching Strategies (GATS) a 

Grade 1 


358 


99% 


354 


99% 


369 


99% 


352 


100% 


367 


99% 


354 


99% 


Grade 2 


334 


99% 


334 


100% 


342 


99% 


336 


99% 


342 


99% 


338 


99% 


Surveys 

Grade 1 Teacher 


















328 


87% 


317 


88% 


Grade 2 Teacher 


















313 


89% 


304 


87% 


Grade 3 Teacher 


















286 


84% 


244 


74% 


Reading Coach 


















123 


99% 


105 


89% 


Principal 


















104 


83% 


99 


80% 


Site/District Interviews 


















18 


100% 


18 


100% 



Notes: 

a In 12 sites, the SAT 10 classroom sample mirrors the observation (and TOSWRF) classroom samples; in the remaining 6 sites, state and district testing requirements meant that all classrooms were tested. 
b The TOSWRF classroom sample mirrors the classrooms selected for classroom observations. 

c In each round of two classroom observations, the STEP was administered once while the IPRI was administered twice. 

d At the conclusion of each IPRI observation (two per classroom), the observer completed a GATS form for the classroom. Information presented here on the GATS was combined to produce a single record 
per classroom. 

Blank cells indicate no data collection for that component at that time period. Response rates shown are for the analytic sample of 248 schools. 

Active consent (i.e., only students whose parents had signed and returned consent forms) was used in fall 2004. Passive consent (i.e., all eligible students were tested unless their parents submitted forms 
refusing to allow their children to be tested) was used in subsequent test administrations. 

Reading instruction in each classroom was observed on two consecutive days in each wave of data collection. Observations of student engagement were scheduled for the same classrooms as observations of 
teachers’ reading instruction. (See Appendix C for a complete discussion of the observation protocols). 

The numbers reported here for SAT 10 student assessments differ from those in Exhibit 3.2 in the Interim Report because the Interim Report incorrectly presented the numbers of students eligible to be tested 
rather than the number of students tested. Note that the response rates (the number of students tested divided by the number of students eligible to be tested) were correct in Exhibit 3.2 in the Interim Report, 
and are reproduced here. 

EXHIBIT READS: During spring 2006, there were 6,522 student assessments completed in Reading First grade 1 classrooms, corresponding to 86 percent of all eligible student assessments. 



CD 







Exhibit 1.4 lists the principal domains for the study, the outcome measures within each domain, and the 
data sources for each measure. 14 These include: 

Student reading performance, assessed with the reading comprehension subtest of the Stanford 
Achievement Test, 10th Edition (SAT 10, Harcourt Assessment, Inc., 2004). The SAT 10 was 
administered to students in grades one, two and three during fall 2004, spring 2005, spring 2006, and 
spring 2007, with an average completion rate of 83 percent across all administrations. In the spring of 
2007 only, first grade students were assessed with the Test of Silent Word Reading Fluency (TOSWRF, 
Mather et al., 2004), a measure designed to assess students’ ability to decode words from among strings 
of letters. The average completion rate was 86 percent. Three outcome measures of student reading 
performance were created from SAT 10 and TOSWRF data. 

Individualized student testing on all five essential components of reading skill emphasized by Reading 
First was not conducted due to concerns about cost as well as about the burden of study data collection on 
schools and students. The study team selected reading comprehension as the central reading achievement 
construct for the study, recognizing that the other four essential components would not be assessed. The 
selection of reading comprehension reflected its importance as the “essence of reading” that sets the stage 
for children’s later academic success (National Institute of Child Health and Human Development, 2000). 
The SAT 10 reading comprehension subtest chosen is feasible in group-administered settings and on a 
large scale, and this test was already being used by some study sites, which reduced the burden on schools 
and students. 

Midway through the evaluation, the study team, in conjunction with IES, decided to add a test of skills 
that precede comprehension. The study added a decoding test to assess whether the Reading First program 
had an effect on this skill. Resources were insufficient to expand the data collection into all grades. 
Because the programmatic emphasis on decoding skill was hypothesized to be more intensive in first 
grade, the study added the Test of Silent Word Reading Fluency only in first grade. 

Classroom reading instruction, assessed in first-grade and second-grade reading classes through an 
observation system developed by the study team called the Instructional Practice in Reading Inventory 
(IPRI). Observations were conducted during scheduled reading blocks in each sampled classroom on two 
consecutive days during each wave of data collection: spring 2005, fall 2005 and spring 2006, and fall 
2006 and spring 2007. The average completion rate was 98 percent across all years. The IPRI can be used 
for observations of varying lengths, reflecting the fact that schools’ defined reading blocks can vary; most 
reading blocks are 90 minutes or more. Observers used a booklet containing a series of individual IPRI 
forms, each of which corresponds to a three-minute interval of observation. The average reading block 
based on observational data was 108 minutes. Eight outcome measures of classroom instruction were 
created from IPRI data to represent the components of reading instruction emphasized by the Reading 
First legislation. 15 



14 Appendix C presents more detailed information, including (where applicable) copies of measures developed specifically for 
the RFIS. 

15 For ease of explication, the measures created from IPRI data are referred to as the five dimensions of reading instruction (or 
“the five dimensions”) throughout the report. References to the programmatic emphases as required by legislation are labeled 
as the five essential components of reading instruction. 



10 



Final Report: Overview of the Reading First Impact Study 





Exhibit 1.4: Description of Domains, Outcome Measures, and Data Sources Utilized in the 
Reading First Impact Study 



Domain 


Outcome Measure and Description 


Source 


Student reading 
performance 


Mean scaled scores for 1 st , 2 nd , and 3 rd grade students, represented as a 
continuous measure of student reading comprehension. Because scaled 
scores are continuous across grade levels, values for all three grade levels 
can be shown on a single set of axes. 


Stanford 
Achievement 
Test, ld h Edition 
(SAT 10) 




Percentage of 1 st , 2 nd , and 3 rd grade students at or above grade level, 

based upon established test norms that correspond to grade level 
performance, by grade and month. The on or above grade level 
performance percentages were based on the start of the school year, date 
of the test and the scaled score, as well as the related grade equivalent. 


Stanford 
Achievement 
Test, ld h Edition 
(SAT 10) 




Mean standard scores for 1 st grade students, represented as a 
continuous measure of first grade students’ decoding skill. 


Test of Silent 
Word Reading 
Fluency 


Classroom 

reading 

instruction 


Minutes of instruction in phonemic awareness, or how much instructional 
time 1 st and 2 nd grade teachers spent on phonemic awareness. 


RFIS 

Instructional 
Practice in 
Reading 
Inventory 




Minutes of instruction in phonics, or how much instructional time 1 st and 
2 nd grade teachers spent on phonics. 


RFIS IPRI 




Minutes of instruction in fluency building, or how much instructional time 
1 st and 2 nd grade teachers spent on fluency building. 


RFIS IPRI 




Minutes of instruction in vocabulary development, or how much 
instructional time 1 st and 2 nd grade teachers spent on vocabulary 
development. 


RFIS IPRI 




Minutes of instruction in comprehension, or how much instructional time 
1 st and 2 nd grade teachers spent on comprehension of connected text. 


RFIS IPRI 




Minutes of instruction in all five dimensions combined, or how much 
instructional time 1 st and 2 nd grade teachers spent on all five dimensions 
combined. 


RFIS IPRI 




Proportion of each observation with highly explicit instruction, or the 

proportion of time spent within the five dimensions when teachers used 
highly explicit instruction (e.g., instruction included teacher modeling, clear 
explanations, and the use of examples). 


RFIS IPRI 




Proportion of each observation with high quality student practice, or 

the proportion of time spent within the five dimensions when teachers 
provided students with high quality student practice opportunities (e.g., 
teachers asked students to practice such word learning strategies as 
context, word structure, and meanings). 


RFIS IPRI 


Student 
engagement 
with print 


Percentage of 1 st and 2 nd grade students engaged with print, 

represented as the per-classroom average of the percentage of students 
engaged with print across three sweeps in each classroom during observed 
reading instruction. 


RFIS Student 
Time-on-Task 
and Engagement 
with Print 
(STEP) 



Final Report: Overview of the Reading First Impact Study 



11 






Exhibit 1.4: Description of Domains, Outcome Measures, and Data Sources Utilized in the 
Reading First Impact Study (continued) 



Domain 


Outcome Measure and Description Source 


Professional 
development in 
scientifically 
based reading 
instruction 


Amount of PD in reading received by teachers, or teachers’ self-reported 
number of hours of professional development in reading during 2006-07. 


RFIS Teacher 
Survey 


Teacher receipt of PD in the five essential components of reading 
instruction, or the number of essential components teachers reported were 
covered in professional development they received during 2006-07. 


RFIS Teacher 
Survey 


Teacher receipt of coaching, or whether or not a teacher reported 
receiving coaching or mentoring from a reading coach in reading programs, 
materials, or strategies in 2006-07. 


RFIS Teacher 
Survey 


Amount of time dedicated to serving as K-3 reading coach, or reading 
coaches’ self-reported percentage of time spent as the K-3 reading coach 
for their school in 2006-07. 


RFIS Reading 
Coach Survey 


Amount of 

reading 

instruction 


Minutes of reading instruction per day, or teachers’ reported average 
amount of time devoted to reading instruction per day over the prior week. 


RFIS Teacher 
Survey 


Supports for 

struggling 

readers 


Availability of differentiated instructional materials for struggling 

readers, or whether or not schools reported that specialized instructional 
materials beyond the core reading program were available for struggling 
readers. 


RFIS Reading 
Coach and 
Principal 
Surveys 


Provision of extra classroom practice for struggling readers, or the 

number of dimensions in which teachers reported providing extra practice 
opportunities for struggling students in the past month. 


RFIS Teacher 
Survey 


Use of 

assessments 


Use of assessments to inform classroom practice, or the number of 
instructional purposes for which teachers reported using assessment 
results. 


RFIS Teacher 
Survey 



To create the six analytic variables about time spent in the dimensions of reading instruction, data from 
classroom observations of instruction were transformed from intervals into minutes. In cases where only 
one instructional behavior/activity was observed, that interval was designated accordingly. In cases where 
multiple instructional behaviors were observed during one three-minute interval, the minutes were 
distributed across the specific instructional behaviors that had been observed. (See Appendix C for a more 
detailed discussion of the transformation of intervals into minutes.) To create the last two analytic 
variables, the data from classroom observations were summed across all the individual three-minute 
intervals within an observation. The total number of intervals (within each observation) with highly 
explicit instruction and high quality student practice was then divided by the total number of intervals 
(within each observation) with instruction in the five dimensions of reading. 

Student engagement with print. Beginning in fall 2005, the study conducted classroom observations 
using the Student Time-on-Task and Engagement with Print (STEP) instrument to measure the percentage 
of students engaged in academic work that are reading or writing print. The STEP was used to observe 
classrooms in fall 2005, spring 2006, fall 2006, and spring 2007, with an average completion rate of 98 
percent across all years. The STEP observer records a time-sampled “snapshot” of student engagement 
three times in each classroom, e.g., three “sweeps” during the designated reading block in each classroom. 
Six minutes after entering the classroom during ongoing reading instruction, the STEP observer begins 
collecting the first of these sweeps. During each sweep, which lasts for approximately three minutes, the 
observer classifies every student in the classroom as either on- or off-task, and, if on-task, whether the 



12 



Final Report: Overview of the Reading First Impact Study 






student is: 1) reading connected text (a story or passage); 2) reading isolated text (letters, words, or 
isolated sentences); and/or 3) writing. The STEP observer waits until six minutes have elapsed between 
the end of one sweep and the start of the next. After the third and final sweep, the STEP observer leaves 
the classroom. The STEP observer typically completes STEP observations in three classrooms spending 
about 25-30 minutes in each classroom. Data collected with the STEP measure are used to create one 
outcome representing the average percentage of students engaged with print during the designated reading 
block. 

Professional development in scientifically based reading instruction, amount of reading instruction, 
supports for struggling readers, and use of assessments. Within these four domains, eight outcome 
measures were created based on data from surveys of principals, reading coaches, and teachers about 
school and classroom resources. The eight outcome measures represent aspects of scientifically based 
reading instruction promoted by the Reading First legislation and guidance. Surveys were fielded in 
spring 2005 and again in spring 2007 with an average completion rate across all respondents of 73 percent 
in spring 2005 and 86 percent in spring 2007. This final report includes findings from 2007 surveys only. 

Additional data were collected by the study team in order to create measures used in correlational 
analyses. These data include: 

The Global Appraisal of Teaching Strategies (GATS), a 12-item checklist designed to measure teachers’ 
instructional strategies related to overall instructional organization and order, is adapted from “The 
Checklist of Teacher Competencies” (Foorman and Schatschneider, 2003). Unlike the IPRE which 
focuses on discrete teacher behaviors, the GATS was designed to capture global classroom management 
and environmental factors. Items covered topics such as the teacher’s organization of materials, lesson 
delivery, responsiveness to students, and behavior management. The GATS was completed by the 
classroom observer immediately after each IPRI observation, meaning that each sampled classroom was 
rated on the GATS twice in the fall and twice in the spring in both the 2005-2006 school year and the 
2006-2007 school year. The GATS was fielded in fall 2005, spring 2006, fall 2006, and spring 2007, with 
an average completion rate of over 99 percent. A single measure from the GATS data was created for use 
in correlational analyses. 

Study’s Methodological Approach 

This section summarizes key features of the study’s methodological approach, including use of multi- 
level models, determination of statistical significance, and multiple hypothesis testing. More detailed 
information about the study’s approach is presented in Appendix B. 

Approach to Estimating Impacts 

As described in detail in Appendix B, and in the study’s Interim Report, all impact estimates are 
regression-adjusted to control for (1) a linear specification of each site’s specific rating variable for 
selecting Reading First schools, and (2) selected student background characteristics used in the analysis 
(Gamse, Bloom, Kemple, & Jacob, 2008). 16 The impacts have been estimated using multi-level models to 
account for the clustering of students within classrooms, classrooms within schools, and schools within 



16 See Appendix B for a description of the background characteristics used in the estimation of impacts. 



Final Report: Overview of the Reading First Impact Study 



13 





sites. Throughout this report, tables that display impact estimates present values in the “Actual Mean with 
Reading First” column that are actual, unadjusted values for Reading First schools. The values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have 
happened in RF schools absent RF funding, and these are calculated by subtracting the impact estimates 
from the RF schools’ actual mean values. 

Statistical Significance 

Two-tailed t-tests are used to assess the statistical significance of impact estimates, and an asterisk (*) 
denotes statistically significant estimates at the conventional 0.05 probability level. The 0.05 standard for 
statistical significance implies that if a true impact is zero, there is only a one-in-twenty chance that its 
estimate will be statistically significant. Statistical significance does not represent the size, meaning, or 
importance of an impact estimate. It only indicates the probability that it occurred by chance. For 
example, a statistically significant impact estimate is not necessarily policy relevant; it is large enough 
that it is likely not due entirely to chance. This could occur for a small impact estimate from a large 
sample, for which the actual size of the estimated impact might not be deemed substantively meaningful, 
even though it was statistically significant. Conversely, lack of statistical significance for an impact 
estimate does not mean that the impact being estimated equals zero, only that that estimate cannot be 
distinguished from zero reliably. This could occur for a large impact estimate from a small sample, for 
which the actual size of the estimated impact might be substantively meaningful, although there is 
uncertainty about the estimate. 

The Reading First Impact Study focuses on several different outcomes and subgroups, and therefore 
estimates numerous impacts. Each individual estimate has only a 5 percent chance of falsely indicating an 
impact’s statistical significance when there is no impact. However, the group of estimates together has a 
much greater chance of falsely indicating that some impacts are statistically significant, even if none are. 

Given the study’s broad research questions, the number of impacts estimated was limited to the minimum 
possible to reduce the problem of “multiple hypotheses testing.” 17 As a further safeguard, composite 
hypothesis tests were used to assess the overall statistical significance for groups of impact estimates 
within the core outcome domains described in Exhibit 1.4: student reading performance, classroom 
reading instruction, student engagement with print, professional development in SBRI, amount of reading 
instruction, supports for struggling readers, and use of assessments. These composite tests measure the 
statistical significance of impact estimates that are pooled across outcome measures, subgroups, or both. 

A statistically significant composite test would suggest that some of its components are statistically 
significant. If the composite test is not statistically significant, the statistically significant findings for its 
components might be due to chance. The composite tests therefore help to “qualify,” or call into question, 
statements that are based on individual findings. 18 



1 Researchers disagree about whether and how to account for multiple hypothesis testing (e.g., Gelman and Stern, 2006; 
Schochet, 2008; Shaffer, 1995). 

18 See Appendix B for a detailed discussion of the study’s approach to multiple hypothesis testing. 



14 



Final Report: Overview of the Reading First Impact Study 





Roadmap to this Report 



Chapter Two addresses the study’s first two evaluation questions about impacts on instruction and on 
reading achievement for the study sites. Chapter Three presents the results of several exploratory 
analyses, pertaining to variation in impacts and relationships among instructional practices and student 
reading comprehension (in response to the study’s third research question). 



Final Report: Overview of the Reading First Impact Study 



15 





Chapter Two: Impact Findings 



This chapter addresses the study’s first two evaluation questions pertaining to Reading First impacts on 
classroom reading instructional practices and reading comprehension test scores. The core impact results 
are averaged across the study’s 18 sites and pooled across the 2004-05, 2005-06, and 2006-07 school 
years. The study pools estimates both to improve statistical power and to be more parsimonious with 
respect to findings. The differences in impacts among the three years are not statistically significant for 
data collected in all three years. (Appendix E presents impact estimates separately for each follow-up 
year.) In addition, the chapter presents Reading First impacts on measures administered in the spring of 
2007: a measure of students’ decoding skills administered to first graders and surveys administered to 
educational personnel. 19 As noted in Chapter One, all tables that display impact findings present values in 
the “Actual Mean with Reading First” column that are actual, unadjusted values for Reading First 
schools. The values in the “Estimated Mean without Reading First” column represent the best estimates of 
what would have happened in RF schools absent RF funding. Impact estimates are regression-adjusted to 
control for a linear specification of the rating variable used by sites to select Reading First schools. 
Estimates were obtained from multi-level statistical models that account for the clustering of students 
within classrooms, classrooms within schools, and schools within sites. 20 Impacts were estimated for each 
study site and then averaged across sites in proportion to their number of Reading First schools in the 
study sample. 

Average Impacts on Reading Instruction 

Exhibits 2.1, 2.2, and 2.3 present estimated impacts on classroom reading instruction and student 
engagement with print. These estimates are based on data from classroom observations conducted in the 
18 study sites during the 2004-05, 2005-06, and 2006-07 school years. 

• Reading First produced a statistically significant positive impact on the total time that 
teachers spent on the five essential components of reading instruction promoted by the 
program. 

Exhibit 2.1 indicates that first- and second-grade teachers in Reading First schools spent 59 minutes, on 
average, during the approximately 112 minutes of the average daily reading block teaching phonemic 
awareness, phonics, vocabulary, fluency and/or comprehension. 21 This reflects a program impact of 6.9 
additional minutes per daily reading block in grade one and 9.8 additional minutes per daily reading block 
in grade two. Over the course of a week, this represents an additional 35 minutes for grade one and 49 
minutes for grade two. 



19 Appendix D presents 95 percent confidence intervals for main impacts in relevant metrics as well as effect sizes. Confidence 
intervals for estimated impacts are reported for reading comprehension, decoding, instructional outcomes, and student 
engagement with print. 

20 See Appendix B for a discussion of the study’s approach to estimating impacts. 

21 The number of minutes of reading instruction used in impact analyses is based on observational data, which differs slightly 
from number of minutes reported on surveys. 



Final Report: Impact Findings 



17 





Exhibit 2.1: Estimated Impacts on Instructional Outcomes: 2005, 2006, and 2007 (pooled) 



Construct 


Actual 

Mean 

With 

Reading 

First 


Estimated 

Mean 

Without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Grade 1 

Minutes of instruction in the five dimensions 
combined 


59.23 


52.31 


6.92* 


0.33* 


(0.005) 


Percentage of intervals in five dimensions with 
highly explicit instruction 


29.39 


26.10 


3.29* 


0.18* 


(0.018) 


Percentage of intervals in five dimensions with 
High Quality Student Practice 


18.44 


17.61 


0.82 


0.05 


(0.513) 


Grade 2 

Number of minutes of instruction in the five 
dimensions combined 


59.08 


49.30 


9.79* 


0.46* 


(<0.001 ) 


Percentage of intervals in five dimensions with 
highly explicit instruction 


30.95 


27.95 


3.00* 


0.16* 


(0.040) 


Percentage of intervals in five dimensions with 
High Quality Student Practice 


17.82 


14.88 


2.94* 


0.16* 


(0.019) 



NOTES: 

The complete Reading First Impact Study sample includes 248 schools from 18 sites (17 districts and 1 state) located in 13 
states. 125 schools are Reading First schools and 123 are non-Reading First schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools pooled across the spring 2005, fall 2005, and spring 2006 IPRI data (by grade). 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools 
absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The observed mean amount of time spent per daily reading block in instruction in the five 
dimensions combined for first grade classrooms with Reading First was 59.23 minutes. The estimated mean amount of 
time without Reading First was 52.31 minutes. The impact of Reading First on the amount of time spent in instruction 
in the five dimensions combined was 6.92 (or 0.33 standard deviations), which was statistically significant (p=.005). 

SOURCES: RFIS Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 



• Reading First produced a statistically significant positive impact on the use of highly explicit 
instruction in grades one and two, and a statistically significant increase in the amount of high 
quality student practice in grade two. Its estimated impact on high quality student practice for 
grade one was not statistically significant. 

For first-grade classrooms in Reading First schools, 29 percent of the observation intervals with 
instruction in the five dimensions also involved highly explicit instruction (active teaching, modeling or 
explaining concepts, and helping children to use reading strategies). This average was 3 1 percent for 
second-grade classrooms. These findings represent a program impact of 3.29 percentage points for first 
grade and 3.00 percentage points for second grade. 

For first-grade and second-grade classrooms in Reading First schools, approximately 18 percent of the 
observation intervals that included instruction in the five dimensions also involved high quality student 
practice (component-specific opportunities for students to practice their skills). These findings represent a 



18 



Final Report: Impact Findings 








Exhibit 2.2: Estimated Impacts On the Number of Minutes in Instruction in Each of the Five 
Dimensions of Reading: 2005, 2006, and 2007 (pooled) 





Actual 


Estimated 










Mean 


Mean 






Statistical 




With 


Without 




Effect 


Significance 




Reading 


Reading 




Size of 


of Impact 


Number of minutes of instruction in: 


First 


First 


Impact 


Impact 


(p-value) 


Grade 1 












Phonemic Awareness 


2.32 


1.71 


0.61* 


0.23* 


(0.030) 


Phonics 


21.32 


18.45 


2.86* 


0.21* 


(0.048) 


Vocabulary 


7.92 


7.35 


0.57 


0.09 


(0.386) 


Fluency 


4.67 


3.43 


1.24* 


0.20* 


(0.043) 


Comprehension 


23.01 


21.23 


1.78 


0.12 


(0.247) 


Grade 2 












Phonemic Awareness 


0.49 


0.37 


0.12 


0.10 


(0.319) 


Phonics 


13.92 


10.65 


3.27* 


0.31* 


(0.006) 


Vocabulary 


11.79 


10.06 


1.73* 


0.20* 


(0.036) 


Fluency 


4.14 


3.56 


0.58 


0.11 


(0.297) 


Comprehension 


28.74 


24.73 


4.01* 


0.24* 


(0.019) 



NOTES: 

The complete Reading First Impact Study sample includes 248 schools from 18 sites (17 districts and 1 state) located in 13 
states. 125 schools are Reading First schools and 123 are non-Reading First schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools pooled across the spring 2005, fall 2005, and spring 2006 IPRI data (by grade). 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in 
the “Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF 
schools absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The observed mean amount of time spent per daily reading block in instruction in phonemic 
awareness for first grade classrooms with Reading First was 2.32 minutes. The estimated mean amount of time 
without Reading First was 1.71 minutes. The impact of Reading First on the amount of time spent in instruction in 
phonemic awareness was 0.61 minutes (or 0.23 standard deviations), which was statistically significant (p=.030). 



SOURCES: RFIS Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006 and spring 
2007 




Final Report: Impact Findings 19 






Exhibit 2.3: Estimated Impacts on the Percentage of Students Engaged with Print: 2006 



and 2007 


Construct 


Actual 
Mean 
with 
Readin 
g First 


Estimated 

Mean 

without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Grade 1 

Percentage of students engaged with print 

Pooled (SY 2006, SY 2007) 


47.84 


42.52 


5.33 


0.18 


(0.070) 


Grade 2 

Percentage of students engaged with print 

Pooled (SY 2006, SY 2007) 


50.53 


55.27 


-4.75 


-0.17 


(0.104) 


NOTES: 



The complete Reading First Impact Study sample includes 248 schools from 18 sites (17 districts and 1 state) located 
in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading 
First Schools pooled across the fall 2005 and spring 2006 STEP data (by grade). 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values 
in the “Estimated Mean without Reading First” column represent the best estimates of what would have happened in 
RF schools absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean 
values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: For the 2006 and 2007 school years pooled, the actual average percentage of students 
engaged with print in first grade classrooms with Reading First was 47.84 percent. The estimated average 
percentage without Reading First was 42.52 percent. The impact of Reading First on the average percentage of 
student engagement with print was 5.33 percentage points (or 0.18 standard deviations), which was not 
statistically significant (p=.070). 

SOURCE: RFIS Student Time-on-Task and Engagement with Print, fall 2005, spring 2006, fall 2006, and spring 2007 



program impact of 2.94 percentage points for second grade and 0.82 percentage points for first grade 
(which was not statistically significant). 

A composite test of the six impact estimates in Exhibit 2.1 was conducted by combining its three 
measures into one index and pooling the data for grades one and two. (See Appendix B, Exhibit B.7). 

This test indicates a statistically significant overall impact of Reading First on instructional practice. 

Exhibit 2.2 presents separate impact estimates for each of the five Reading First instructional dimensions, 
illustrating the relative emphasis placed by Reading First schools on each dimension, how this emphasis 
differs by grade, and how Reading First impacts are distributed across the dimensions. The majority of 
Reading First instructional time focused on comprehension and phonics, and half of the program’s 
statistically significant instructional impacts were on these two dimensions. 

• First grade teachers in Reading First schools spent about 21.3 minutes on phonics and 23.0 
minutes on comprehension per daily reading block. This reflects an estimated daily impact of 
2.9 additional minutes for phonics (statistically significant) and 1.8 additional minutes for 
comprehension (not statistically significant). Although first grade teachers in Reading First 
schools spent relatively little time on phonemic awareness (an average of 2.3 minutes per 



20 



Final Report: Impact Findings 




reading block) and fluency (4.7 minutes), program impacts on these dimensions were positive 
and statistically significant. 

• Second grade teachers in Reading First schools spent 13.9 minutes on phonics and 28.7 
minutes on comprehension per daily reading block. This reflects statistically significant 
impacts of 3.3 minutes for phonics and 4.0 minutes for comprehension. Reading First also 
produced a statistically significant impact on vocabulary instruction of 1.7 minutes per daily 
reading block. 

Average Impacts on Student Engagement with Print 

Exhibit 2.3 presents estimated impacts on the percentage of students engaged with print during 
observations of reading instruction within the reading block. The measure of student engagement with 
print used in impact analyses is the per-classroom average of the percentage of students engaged with 
print across three observation sweeps in each classroom. 

Approximately 48 percent of first grade students and 51 percent of second grade students in Reading First 
schools were engaged with print during observations of reading instruction within the reading block. The 
estimated impact on student engagement with print was not statistically significant for grade one (5.33 
percentage points) or grade two (-4.75 percentage points). 

Exhibit 2.3 includes two statistical tests of program impacts on the percentage of students engaged with 
print, one for each grade. A composite test was conducted that pools findings across grades; it was not 
statistically significant. (See Appendix B, Exhibit B.7). 

Average Impacts on Key Components of SBRI 

The section below draws from self reported survey data collected at both the school level (surveys of 
principals and reading coaches) and the classroom level (teacher surveys) 22 to assess the extent to which 
components of scientifically based reading instruction (SBRI) have been implemented in study schools. 
Data on such school and classroom level practices can provide information about the levels of these 
practices and whether Reading First has had an impact on them. 

Exhibit 2.4 lists eight outcome measures that represent four domains — professional development in SBRI, 
amount of reading instruction, supports for struggling readers, and use of assessments. Two outcome 
measures are at the school-level and six outcome measures are at the classroom-level. 23 For each measure, 
RDD estimation methods were used to determine if statistically significant differences exist between the 
treatment and comparison groups. 



22 This section reports on 2007 survey findings only. 

23 See Appendix C for a detailed description of the eight survey outcome variables, including the survey items, the item metrics, 
the outcome specifications, and the internal consistency reliability (as applicable). 



Final Report: Impact Findings 



21 





Exhibit 2.4: Estimated Impacts on Key Components of Scientifically Based Reading Instruction 
(SBRI): Spring 2007 



Domain 


Actual 

Mean 

With 

Reading 

First 


Estimated 

Mean 

Without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Professional Development (PD) in SBRI 

Amount of PD in reading received by teachers 
(hours) a 


25.84 


13.71 


12.13* 


0.51* 


(<0.001 ) 


Teacher receipt of PD in the five essential 
components of reading instruction (0-5) a 


4.30 


3.75 


0.55* 


0.31* 


(0.010) 


Teacher receipt of coaching (proportion) a 


0.83 


0.63 


0.20* 


0.41* 


(<0.001 ) 


Amount of time dedicated to serving as K-3 
reading coach (percent) b ' c 


91.06 


57.57 


33.49* 


1.03* 


(<0.001 ) 


Amount of Reading Instruction 

Minutes of reading instruction per day a 


105.71 


87.24 


18.47* 


0.63* 


(<0.001 ) 


Supports for Struggling Readers 

Availability of differentiated instructional 
materials for struggling readers (proportion) 15 


0.98 


0.97 


0.01 


0.15 


(0.661) 


Provision of extra classroom practice for 
struggling readers (0-4) a 


3.79 


3.59 


0.19* 


0.20* 


(0.018) 


Use of Assessments 

Use of assessments to inform classroom 
practice (0-3) a 


2.63 


2.45 


0.18 


0.19 


(0.090) 



NOTES: 

a Classroom level outcome 
b School level outcome 

c The response rates for RF and nonRF reading coach surveys were statistically significantly different (p=0.037). Reading 
first schools were more likely to have had reading coaches and to have returned reading coach surveys. 

d Missing data rates ranged from 0.1 to 3.3 percent for teacher survey outcomes (RF: 0.1 to 1.0 percent; non-RF: 0 to 4.9 
percent) and 1 .3 to 2.8 percent for reading coach and/or principal survey outcomes (RF : 0 to 1.6 percent; non-RF : 2.7 to 4. 1 
percent). Survey constructs (i.e., those outcomes comprised of more than one survey item) were computed only for 
observations with complete data, with one qualification: for the construct “minutes spent on reading instruction per day,” the 
mean was calculated as the total number of minutes reported for last week (over a maximum of 5 days) divided by the 
number of days with non-missing values. Less than one percent of teachers (0.9 percent) were missing data for all 5 days. 

The complete Reading First Impact Study sample includes 248 schools from 18 sites (17 districts and 1 state) located in 13 
states. 125 schools are Reading First schools and 123 are non-Reading First schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools. 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools 
absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The observed mean amount of professional development in reading received by teachers with 
Reading First was 25.84 hours. The estimated mean amount of professional development in reading received by 
teachers without Reading First was 13.71 hours. This impact of 12.13 hours was statistically significantly (p<.001). 

SOURCES: RFIS, Teacher, Reading Coach, and Principal Surveys, spring 2007 



22 



Final Report: Impact Findings 







Exhibit 2.4 indicates that Reading First had a significant impact on the amount, content, and type of 
professional development received by teachers in grades one through three, according to teacher and 
reading coach self-reports. More specifically, there were statistically significant impacts on all four 
outcome measures in the domain of professional development in SBRI: 

• Reading First had a statistically significant impact on the amount of professional 
development in reading teachers reported receiving; this impact was 12.1 hours. 

• Reading First had a statistically significant impact on teachers’ self-reported receipt of 
professional development in the five essential components of reading instruction. Teachers in 
RF schools reported receiving professional development in an average of 4.3 components, 0.6 
components more than would be expected without Reading First (3.7 components). 

• A statistically significantly greater proportion (20 percent) of teachers in RF schools reported 
receiving coaching from a reading coach than would be expected without Reading First. 

• Reading First had a statistically significant impact on the amount of time reading coaches 
reported spending in their role as the school’s reading coach. Reading coaches in RF schools 
reported spending 91.1 percent of their time in this role, 33.5 percentage points more than 
would be expected without Reading First (57.6 percent). 

Reading First had a statistically significant impact on the amount of time teachers reported spending on 
reading instruction per day. Teachers in RF schools reported an average of 105.7 minutes per day, 18.5 
minutes more than would be expected without Reading First (87.2 minutes). 

Reading First had mixed impacts on the availability of supports for struggling readers. 

• Reading First had a statistically significant impact on teachers’ provision of extra classroom 
practice in the essential components of reading instruction in the past month; the estimated 
impact was 0.2 components. 

• There was no statistically significant impact of Reading First on the availability of 
differentiated instructional materials for struggling readers. 

There was no statistically significant impact of Reading First on the teachers’ reported use of assessments 
to inform classroom practice for grouping, diagnostic, and progress monitoring purposes. 

To assess the overall impact of Reading First on these survey items, two composite tests were conducted. 
The first composite test combined the two outcome measures from the reading coach and/or principal 
survey data into a single school-level index; the second composite test combined the six outcome 
measures from the teacher survey data into a single classroom-level index (See Appendix B, Exhibit B.7). 
These tests indicate a statistically significant overall impact of Reading First on the implementation of 
scientifically based reading instruction both at the school-level and the classroom-level. 

In conclusion, estimated impacts based on survey data from RF and non-RF schools in the study sample 
indicate that statistically significant impacts of Reading First are evident in six of the eight outcome 
measures, including the four outcome measures in the professional development in SBRI domain, the 
single outcome measure in the amount of reading instruction domain, and one of two outcome measures 
in the supports available for struggling readers domain. There was no statistically significant impact of 



Final Report: Impact Findings 



23 





RF in the use of assessments domain. These data indicate that RF schools are consistently reporting 
higher levels of implementation of SBR1 practices than would have occurred absent RF. 

Average Impacts on Reading Achievement 

Average Impacts on Reading Comprehension 

Exhibit 2.5 presents estimated Reading First impacts on student reading comprehension scores on the 
SAT 10. These findings reflect impact estimates that are averaged across the 18 study sites and pooled 
across the three study follow-up years (2004-2005, 2005-2006, and 2006-2007). Impact estimates are 
regression-adjusted to control for a linear specification of the rating variable used by sites to select 
Reading First schools and for selected school and student background characteristics. Estimates were 
obtained from multi-level statistical models that account for the clustering of students within classrooms, 
classrooms within schools, and schools within sites. Impacts were estimated for each study site and then 
averaged across sites in proportion to their number of Reading First schools in the study sample. 

• Impacts on student reading comprehension test scores were not statistically significant. 

Estimated impacts were not statistically significant for grade one (4.7 scaled score points or an effect size 
of 0.10 standard deviations), grade two (1.7 scaled score points or an effect size of 0.04 standard 
deviations), or grade three (0.3 scaled score points or an effect size of 0.01 standard deviations). 24 The 
average first, second, and third grade student in Reading First schools was reading at the 44 th , 39 th , and 
39 th percentile, respectively, on the end-of-the-year assessment (on average over the three years of data 
collection). 

Exhibit 2.5 includes six statistical tests of program impacts on reading comprehension — one for each 
combination of grade and reading comprehension measure. A composite test of these estimates using an 
index that combines measures and pools the sample across grades was not statistically significant. (See 
Appendix B, Exhibit B.7). 25 

Average Impacts on Decoding Skills for Students in Grade One in Spring 2007 

For the final year of data collection, first grade students were also assessed with the Test of Silent Word 
Reading Fluency (TOSWRF, Mather et al., 2004). The TOSWRF is a short three-minute assessment that 
measures students’ ability to identify words quickly and correctly. This assessment was added to explore 
whether Reading First has an impact on decoding skills, another of the five components of reading skill 
targeted by Reading First (along with comprehension, vocabulary, phonemic awareness, and fluency). 
The assessment was added in the last year of the study’s data collection, which means that the TOSWRF 



24 The study also examined third grade reading achievement scores on state-required assessments for the core sample for 2006 
scores only (excluding one site that had no third grade assessment and another site that did not use a percent proficient metric). 
These results are shown in Appendix E, Part 3. The results are consistent with the Grade Three results for the SAT 10. 

23 For technical reasons, the index used in the composite test for student reading performance includes only the two SAT 10 
measures for which data are available across grades. The TOSWRF could not be included in the index because data were only 
available for one grade. 



24 



Final Report: Impact Findings 





Exhibit 2.5: Estimated Impacts on Reading Comprehension: Spring 2005, 2006, and 2007 (Pooled) 





Actual 






Statistical 




Mean with 


Estimated 


Effect 


Significance 




Reading 


Mean without 


Size of 


of Impact 


Construct 


First 


Reading First Impact 


Impact 


(p-value) 



Panel 1 
All Sites 

Reading Comprehension Scaled Score 



Grade 1 



Scaled Score 


543.8 


539.1 


4.7 


0.10 


(0.083) 


Corresponding Grade Equivalent a 


1.7 


1.7 








Corresponding Percentile 


44 


41 








Grade 2 












Scaled Score 


584.4 


582.8 


1.7 


0.04 


(0.462) 


Corresponding Grade Equivalent a 


2.5 


2.4 








Corresponding Percentile 


39 


38 








Grade 3 












Scaled Score 


609.1 


608.8 


0.3 


0.01 


(0.887) 


Corresponding Grade Equivalent a 


3.3 


3.3 








Corresponding Percentile 


39 


39 








Panel 2 












All Sites 












Percent Reading At or Above Grade Level b 












Grade 1 


46.0 


41.8 


4.2 




(0.104) 


Grade 2 


38.9 


37.3 


1.6 




(0.504) 


Grade 3 


38.7 


38.8 


-0.1 




(0.973) 



NOTES: 

The complete Reading First Impact Study sample includes 248 schools from 18 sites (17 school districts and 1 state) located in 13 
states. 125 schools are Reading First schools and 123 are non-Reading First schools. For grade 2 in 2006, one non-RF school 
could not be included in the analysis because test score data were not available. For grade 3 in 2007, one RF school could not be 
included in the analysis because test score data were not available. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
schools pooled across the spring 2005 and 2006 SAT 10 test scores (by grade). 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools absent 
RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

a Grade equivalent scores are based on a nine-month school year, are reported in decimal format (year.month), and provide an 
estimate of the performance that an average student at a grade level is assumed to demonstrate on the test at a particular month in 
the school year. For example, a score of 1.7 represents a performance level typical of a first grade student in the seventh month of 
the school year. 

b The “at or above grade level” variable is dichotomous, therefore effect sizes are not appropriate. 

EXHIBIT READS: The observed mean reading comprehension score for first-graders with Reading First was 543.8 scaled 
score points. The estimated mean without Reading First was 539.1 scaled score points. The impact of Reading First was 
4.7 scaled score points (or 0.10 standard deviations), which was not statistically significant (p=.083). The observed average 
percent of first-graders reading at or above grade level with Reading First was 46.0 percentage points. The estimated 
average percent without Reading First was 41.8 percentage points. The impact of Reading First on the percent of first 
grade students reading at or above grade level was 4.2 percentage points, which was not statistically significant (p=.104). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, as well as from state/district education agencies in 
those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR). 



Final Report: Impact Findings 



25 







was administered to first grade students only once in the spring of 2007. Thus, unlike the reading 
comprehension impact estimates, which are available for grades one, two, and three, and pooled across 
three school years, the decoding results reflect only one of the three follow up years of data collection and 
are available for only grade one. 

Exhibit 2.6 summarizes findings from an analysis of Reading First’s impact on TOSWRF scores for first 
grade students in spring 2007. 

• Reading First produced a statistically significant positive impact on TOSWRF scores of 2.5 
standard score points, equal to an effect size of 0.17 standard deviations. 



Exhibit 2.6: Estimated Impacts of Reading First on Decoding Skill: Grade One, Spring 2007 





Actual 

Mean 

with 

Reading 

First 


Estimated 

Mean 

without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Decoding Skill 












Standard Score 


96.9 


94.4 


2.5 * 


0.17 * 


(0.025) 


Corresponding Grade Equivalent a 


1.7 


1.4 








Corresponding Percentile 


42 


35 









NOTES: 

The Test of Silent Word Reading Fluency (TOSWRF) sample includes first-graders in 248 schools from 18 sites (17 school 
districts and 1 state) located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools from spring 2007 TOSWRF test scores (1 st grade). 

The key metric for the TOSWRF analyses is the standard score, corresponding grade equivalents and percentiles are provided 
for reference. Although the publisher of the Test of Silent Word Reading Fluency states that straight comparisons between 
standard scores and grade equivalents will likely yield discrepancies due to the unreliability of the grade equivalents, they are 
provided because program criteria are sometimes based on grade equivalents. 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools 
absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

a Grade equivalent scores are based on a nine-month school year, are reported in decimal format (year.month), and provide an 
estimate of the performance that an average student at a grade level is assumed to demonstrate on the test at a particular month 
in the school year. For example, a score of 1.7 represents a perfonnance level typical of a first grade student in the seventh 
month of the school year. 

EXHIBIT READS: The observed mean silent word reading fluency standard score for first-graders with Reading First 
was 96.9 standard score points. The estimated mean without Reading First was 94.4 standard score points. The impact 
of Reading First was 2.5 standard score points (or 0.17 standard deviations), which was statistically significant 
(p=.025). 

SOURCES: RFIS TOSWRF administration in spring 2007 



26 



Final Report: Impact Findings 









Summary 



The findings presented in this chapter are generally consistent with findings presented in the study’s 
Interim Report, which found statistically significant impacts on instructional time spent on the five 
essential components of reading instruction promoted by the program (phonemic awareness, phonics, 
vocabulary, fluency, and comprehension) in grades one and two, and which found no statistically 
significant impact on reading comprehension as measured by the SAT 1 0. In addition to data on the 
instructional and student achievement outcomes reported in the Interim Report, the final report also 
presents findings based upon information obtained during the study’s third year of data collection: data 
from a measure of first grade students’ decoding skill and data from self-reported surveys of educational 
personnel in study schools. 

The additional data sources provide more information about the contexts within which the Reading First 
program has operated. The information obtained from the Test of Silent Word Reading Fluency indicates 
that Reading First had a positive and statistically significant impact on first grade students’ decoding skill. 
Through surveys, Reading First school personnel reported implementing the key programmatic 
components outlined in the enabling legislation. 

A frequent criticism of the interim report was that the scientifically based reading practices promoted by 
Reading First have been diffused to non-Reading First schools, thus diluting the impact of Reading First 
(see, for example, the response to the Interim Report by the Reading First Federal Advisory Committee, 
2008). States could reserve up to 20 percent of their Reading First funds to support staff development, 
technical assistance to districts and schools, and planning, administration, and reporting. According to the 
program guidance, this funding provided “states with the resources and opportunity. . .to improve 
instruction beyond the specific districts and schools that receive Reading First subgrants” (U.S. 
Department of Education, 2002). 

The results from both observational and survey data indicate that Reading First produced statistically 
significant impacts on instruction and reading program implementation. These differences are inconsistent 
with the view that the treatment had diffused to the extent that diffusion means that practices were the 
same in RF and non-RF schools. However, there are no data available on reading practices in study 
schools prior to Reading First implementation. Thus, the study cannot provide a definitive statement as to 
the presence or absence of diffusion. 



Final Report: Impact Findings 



27 





Chapter Three: Exploratory Analyses of Variations in 
Impacts and Relationships among Outcomes 



The Reading First Impact Study was designed to test the impact of the receipt of Reading First funds at 
the school level. The study was conducted in 248 schools located in 18 sites in 13 states. The study 
focused on student reading achievement, as well as teachers’ classroom reading practices. Analyses of 
impact were conducted for data collected during three school years (2004-05, 2005-06, and 2006-07), 
representing between one and four years of program implementation, depending on the site. 

The results reported in Chapter Two indicate that the receipt of Reading First funding at the school level 
produced an impact on the amount of time teachers spent on the five components of reading instruction 
promoted by the program and on first graders’ decoding skills, but not on student reading comprehension. 
The sections below describe exploratory analyses that examine some hypotheses about factors that might 
account for the observed pattern of impacts. The results are based on analyses of subgroups of students, 
schools, grade levels, and/or years of data collection. The information provides possible avenues for 
further exploration or for improving Reading First or programs like Reading First. Because the study was 
not designed to provide a rigorous test of the hypotheses explored in this chapter, the results are only 
suggestive. The methodological literature about subgroup analyses highlights the importance of 
specifying hypotheses in advance, limiting the number of additional tests, and interpreting results with 
considerable caution. (See, for example, Hernandez, Boersma, Murray, Steyerberg, 2006; Rothwell, 2005; 
Wang, R., Lagakos, S.W., Ware, J.H., Hunter, D J., & Drazen, J.M., 2007). 

The first section of this chapter examines variation in impacts. The second section examines the 
relationship between classroom reading instruction and student achievement. 

Variation in Impacts 

The core impact analyses reported in Chapter Two are average impacts, meant to represent the impact for 
the average Reading First school in the sample. It is reasonable to wonder whether these overall averages 
might be masking differences in impacts that could be attributed to variation in: 1) time of RF 
implementation; 2) student exposure to RF; or 3) sites. The following section explores these hypotheses. 

Variation in Impacts Over Time 

This section explores the question of whether the impact estimates presented in Chapter Two — which are 
pooled across three school years — may be masking changes in impacts over time. 26 

Three approaches were used to address the question of possible changes in impacts over time. First, we 
examined estimated impacts on instructional and reading comprehension outcomes for each year of the 
study (and pooled) at a given grade level. Next, we conducted two types of statistical tests. The first test, 
which is a more restrictive test, assessed whether there was a linear trend (year-to-year change) of impacts 



~ f ’ Additional analyses of student achievement trends for the RFIS study sample, including patterns of mean SAT 10 scores in 
grades one through three and state-mandated reading assessments in grade three, are presented in Appendix E. 



Final Report: Exploratory Analyses 



29 





over time for successive cohorts of first, second, and third graders (if applicable). The second test, a 
global F-test, assessed whether there was any overall variation in the impacts over the study years for a 
given grade level. If inconsistencies in statistical significance were found between these two tests, then 
the results of either test were interpreted with caution. 

For instructional time in the five dimensions combined. Exhibit 3.1 indicates that when impacts are 
estimated separately for each grade and year, those impacts decrease over time for each grade. 27 For 
example, for minutes of instruction in the five dimensions combined in Grade One, the impact was 8.89 
minutes per reading block in Spring 2005, 8.71 minutes in School Year 2006, 5.92 minutes in School year 
2007, and 6.92 minutes for all years pooled. The first statistical test of a linear time trend for these 
impacts suggests a statistically significant annual decline in impacts on time in the five dimensions of 2.6 
minutes per daily reading block for grade one and 2.9 minutes per daily reading block for grade two 
(Exhibit 3.2). However, the second global F-test for each grade of the null hypothesis of no variation 
across three years suggests that the variation for grade one was not statistically significant while the 
variation for grade two was statistically significant (Exhibit 3.2). Thus, readers should be particularly 
cautious when inferring a systematic pattern of decline in impacts on time in the five dimensions for first 
grade. At the same time, it does appear that the decline in impacts on time in the five dimensions for 
second grade was more systematic. 

Findings for reading comprehension scores, estimated separately for each grade and year, suggest that 
impacts increased over time for each grade (Exhibit 3.3). For example, in Grade One, the impact was 2.2 
scaled score points in Spring 2005, 5.3 scaled score points in Spring 2006, 7.5 scaled score points in 
Spring 2007, and 4.7 scaled score points for all years pooled. The first statistical test of a linear trend for 
impacts suggests that only for grade three was there a statistically significant increase. Estimates of a 
linear impact trend for all three grades pooled indicate a statistically significant increase of 2.5 scaled 
score points per year (Exhibit 3.2). However, the global F-test of the null hypothesis of no variation 
across three years was not statistically significant for any grade (Exhibit 3.2). Thus, readers should be 
cautious about inferring a systematic pattern of increasing impacts over time on reading comprehension. 

In sum, these analyses do not provide conclusive support for the hypothesis that the core impact estimates 
presented in Chapter Two are masking variation in impacts over time in either reading instruction in grade 
one or in student reading comprehension in grades one, two or three. For reading instruction, there 
appears to be a systematic decline in impacts in grade two. 

Variation in Impacts on Reading Comprehension Associated with Student 
Exposure to Reading First Schools 

Reading First is intended to provide students with a complete instructional program from kindergarten 
through third grade. However, because of student mobility and the coincident timing of both the start of 
the program and of the study, many students in the study sample may not have experienced the fullest 
exposure possible (four full school years, K through 3) to Reading First instructional practices and 
support services. For example, in the group of study sites that began implementing RF in 2004-2005, third 



27 These same analyses were also conducted for each dimension separately (phonemic awareness, phonics, vocabulary, fluency, 
and comprehension) and results are presented in Appendix E, Exhibits E.l and E.2. Results of these analyses for the STEP are 
also presented in Appendix E, Exhibit E.3. 



30 



Final Report: Exploratory Analyses 





Exhibit 3.1: Estimated Impacts on Instructional Outcomes: 2005, 2006, and 2007, and Pooled 



Construct 


Actual 

Mean 

With 

Reading 

First 


Estimated 

Mean 

Without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Grade 1 

Minutes of instruction in the five 
dimensions combined 


Spring 2005 


59.23 


50.34 


8.89* 


0.43* 


(0.007) 


School year 2006 


59.49 


50.78 


8.71* 


0.42* 


(0.010) 


School year 2007 


58.93 


53.00 


5.92 


0.28 


(0.050) 


Pooled 3 years (Sp05, Sy06, Sy07) 


59.23 


52.31 


6.92* 


0.33* 


(0.005) 


Percentage of intervals in five dimensions 
with highly explicit instruction 


Spring 2005 


29.71 


22.38 


7.33* 


0.41* 


(0.003) 


School year 2006 


29.76 


27.90 


1.86 


0.10 


(0.326) 


School year 2007 


28.73 


25.90 


2.83 


0.16 


(0.169) 


Pooled 3 years (Sp05, Sy06, Sy07) 


29.39 


26.10 


3.29* 


0.18* 


(0.018) 


Percentage of intervals in five dimensions 
with High Quality Student Practice 


Spring 2005 


21.31 


22.05 


-0.74 


-0.04 


(0.749) 


School year 2006 


17.99 


16.25 


1.75 


0.10 


(0.295) 


School year 2007 


17.24 


15.55 


1.69 


0.10 


(0.300) 


Pooled 3 years (Sp05, Sy06, Sy07) 


18.44 


17.61 


0.82 


0.05 


(0.513) 


Grade 2 

Minutes of instruction in the five 
dimensions combined 


Spring 2005 


58.33 


45.25 


13.07* 


0.62* 


(<0.001 ) 


School year 2006 


60.14 


49.30 


10.84* 


0.51* 


(0.001) 


School year 2007 


58.57 


52.06 


6.51* 


0.31* 


(0.029) 


Pooled 3 years (Sp05, Sy06, Sy07) 


59.08 


49.30 


9.79* 


0.46* 


(<0.001 ) 


Percentage of intervals in five dimensions 
with highly explicit instruction 


Spring 2005 


32.02 


25.15 


6.86* 


0.36* 


(0.008) 


School year 2006 


31.33 


24.38 


6.95* 


0.36* 


(0.001) 


School year 2007 


30.02 


31.97 


-1.95 


-0.10 


(0.309) 


Pooled 3 years (Sp05, Sy06, Sy07) 


30.95 


27.95 


3.00* 


0.16* 


(0.040) 


Percentage of intervals in five dimensions 
with High Quality Student Practice 


Spring 2005 


22.86 


18.96 


3.90 


0.22 


(0.083) 


School year 2006 


16.40 


13.04 


3.35* 


0.19* 


(0.043) 


School year 2007 


16.40 


14.24 


2.16 


0.12 


(0.212) 


Pooled 3 years (Sp05, Sy06, Sy07) 


17.82 


14.88 


2.94* 


0.16* 


(0.019) 



NOTES: 

The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 districts and 1 state) located in 
13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools pooled across the spring 2005, fall 2005, and spring 2006 IPRI data (by grade). 

Impact estimates are statistically adjusted to reflect the regression discontinuity design of the study. 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools absent 
RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The observed mean amount of time spent per daily reading block in instruction in the five dimensions 
combined for first grade classrooms with Reading First was 59.23 minutes in spring 2005. The estimated mean amount of 
time without Reading First was 50.34 minutes. The impact of Reading First on the amount of time spent in instruction in 
the five dimensions combined was 8.89 minutes, which was statistically significant (p=.007). 

SOURCES: RFIS Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 



Final Report: Exploratory Analyses 



31 








Exhibit 3.2: Change Over Time in Program Impact on Reading Comprehension and Instruction 



Reading Comprehension Reading Instruction 
(SAT 10 Scaled Score) (min. in 5 Dimensions) 



Grade 1 


Linear Year-to-Year Change 


2.82 


-2.59* 




SE 


2.07 


1.22 




p- value 

F-test for overall variation 


0.174 


0.034 




across years 


0.808 


1.48 




p- value 


0.446 


0.22 


Grade 2 


Linear Year-to-Year Change 


0.53 


-2.88* 




SE 


1.77 


1.25 




p-value 

F-test for overall variation 


0.766 


0.021 




across years 


0.072 


5.03* 




p-value 


0.931 


0.025 


Grade 3 


Linear Year-to-Year Change 


3.81* 


n.a. 




SE 


1.74 


n.a. 




p-value 

F-test for overall variation 


0.029 


n.a. 




across years 


2.630 


n.a. 




p-value 


0.072 


n.a. 


All Available Grades a 


Linear Year-to-Year Change 


2.477* 


-2.36* 




SE 


1.08 


0.87 




p-value 

F-test for overall variation 


0.022 


0.007 




across years 


2.712 


4.46* 




p-value 


0.066 


0.035 



NOTES: 

The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) 
located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. For grade 2 in 2006, one 
non-RF school could not be included in the analysis because test score data were not available. For grade 3 in 2007, one RF 
school could not be included in the analysis because test score data were not available. 

^ For Reading Comprehension, grades 1-3 were included in the analysis. For Reading Instruction, only grades 1 and 2 were 
included in the analysis because instructional data were only available for these two grades. 

Impact estimates are statistically adjusted (e.g., take each school’s rating, site-specific funding cut-point, and other covariates 
into account) to reflect the regression discontinuity design of the study. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: For grade 1, the program impact on reading comprehension increases by 2.82 scaled score points 
per year between 2005 and 2007. This change was not statistically significant (p=.174). The program impact on 
instruction in the live dimensions of reading instruction decreases by -2.59 minutes per daily reading block per year. 
This change was statistically significant (p=.034). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing ( i.e ., FL, KS, MD, OR). RFIS Instructional 
Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 



32 



Final Report: Exploratory Analyses 








Exhibit 3.3: Estimated Impacts on Reading Comprehension: Spring 2005, 2006, and 2007, and 
Pooled 



Construct 


Actual 

Mean 

with 

Reading 

First 


Estimated 

Mean 

without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Panel 1 
All Sites 

Reading Comprehension Scaled Score 


Grade 1: Spring 2005 


541.2 


538.9 


2.2 


0.05 


(0.524) 


Spring 2006 


545.7 


540.4 


5.3 


0.11 


(0.152) 


Spring 2007 


545.3 


537.8 


7.5 


0.15 


(0.052) 


Pooled 3 years (2005, 2006, and 2007) 


543.8 


539.1 


4.7 


0.10 


(0.083) 


Grade 2: Spring 2005 


583.5 


582.4 


1.2 


0.03 


(0.654) 


Spring 2006 


585.3 


583.7 


1.6 


0.04 


(0.620) 


Spring 2007 


584.8 


582.3 


2.5 


0.06 


(0.415) 


Pooled 3 years (2005, 2006, and 2007) 


584.4 


582.8 


1.7 


0.04 


(0.462) 


Grade 3: Spring 2005 


607.4 


609.9 


-2.5 


-0.06 


(0.306) 


Spring 2006 


609.5 


610.0 


-0.5 


-0.01 


(0.860) 


Spring 2007 


610.6 


605.1 


5.5 


0.14 


(0.082) 


Pooled 3 years (2005, 2006, and 2007) 


609.1 


608.8 


0.3 


0.01 


(0.887) 


Panel 2 
All Sites 

Percent Reading At or Above Grade Level 1 


Grade 1: Spring 2005 


43.8 


41.6 


2.2 




(0.529) 


Spring 2006 


47.3 


43.0 


4.3 




(0.217) 


Spring 2007 


47.5 


40.3 


7.3* 




(0.047) 


Pooled 3 years (2005, 2006, and 2007) 


46.0 


41.8 


4.2 




(0.104) 


Grade 2: Spring 2005 


38.0 


38.0 


0.0 




(0.996) 


Spring 2006 


39.9 


39.6 


0.3 




(0.926) 


Spring 2007 


39.0 


34.1 


4.9 




(0.121) 


Pooled 3 years (2005, 2006, and 2007) 


38.9 


37.3 


1.6 




(0.504) 


Grade 3: Spring 2005 


36.0 


39.3 


-3.3 




(0.255) 


Spring 2006 


39.9 


40.8 


-0.9 




(0.801) 


Spring 2007 


40.5 


34.8 


5.6 




(0.101) 


Pooled 3 years (2005, 2006, and 2007) 


38.7 


38.8 


-0.1 




(0.973) 



NOTES: 



The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) located in 13 
states. 125 schools are Reading First schools and 123 are non-Reading First schools. For grade 2 in 2006, one non-RF school could not be 
included in the analysis because test score data were not available. For grade 3 in 2007, one RF school could not be included in the 
analysis because test score data were not available. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First schools 
pooled across the spring 2005 and 2006 SAT 10 test scores (by grade). 

Impact estimates are statistically adjusted (e.g., take each school’s rating, site-specific funding cut-point, and other covariates into 
account) to reflect the regression discontinuity design of the study. 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the “Estimated 
Mean without Reading First” column represent the best estimates of what would have happened in RF schools absent RF funding and are 
calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

1 The “at or above grade level” variable is dichotomous, therefore effect sizes are not appropriate. 

EXHIBIT READS: The observed mean reading comprehension score for first-graders with Reading First was 541.2 scaled score 
points in spring 2005. The estimated mean without Reading First was 538.9 scaled score points. The impact of Reading First was 
2.2 scaled score points (or 0.05 standard deviations), which was not statistically significant (p=.524). The observed average percent 
of first-graders reading at or above grade level with Reading First was 43.8 percentage points in spring 2005. The estimated 
average percent without Reading First was 41.6 percentage points. The impact of Reading First on the percent of first grade 
students reading at or above grade level was 2.2 percentage points, which was not statistically significant (p=.529). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, as well as from state/district education agencies in those 
sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR) 



Final Report: Exploratory Analyses 



33 







grade students in 2004-2005 were exposed to RF for only one year, while third graders in those same sites 
in 2006-2007 were exposed to RF for up to three years. The cross-sectional design of the study, in which 
all third grade students’ scores are pooled across years — regardless of number of years of exposure — does 
not account for differing amounts of exposure. As a result, the program’s observed effects may have been 
diluted, if in fact more years of exposure were related to greater impacts. 

To address this issue, a separate analysis was conducted (see Appendix F) to assess the effect of three 
years of observed program exposure. The sample for this analysis comprised all third-graders in spring 
2007 who were in a Reading First school during spring 2007 and 2005 (the program group) or in a non- 
Reading First school at both times (the comparison group). Given existing data, this is the best possible 
approximation to students with three years of program exposure. 28 

Program impacts for this subsample were then estimated for spring 2007 test scores. 

• These findings suggest an average impact of 4.3 scaled score points (not statistically 

significant), which represents an effect size of 0.1 1 standard deviations (Exhibit 3.4). This 
estimate is smaller than that of 5.5 scaled score points (not statistically significant), which 
represents an effect size of 0.14, for all third-graders in spring 2007. 

These impact estimates may be biased if Reading First caused a difference in the types of students who 
move from or stay at the same school. Because the study does not include pre-Reading First 
characteristics for students in the study sample, this question cannot be examined directly. As a result, the 
findings presented in this section should be interpreted with caution. Also, students who remain in schools 
with the same treatment status for three years likely differ along a number of important dimensions from 
students who do not, so the results of this analysis may have limited external validity. 

Variation in Impacts Across Sites 

This section explores whether the impact estimates presented in Chapter Two — which reflect averages 
across the 18 study sites — may be masking systematic differences in impacts among the sites. Study sites 
differ in both local conditions and in the timing that they received their Reading First grants, thus the 
exploratory analyses presented here explore a) site-by-site variation, and b) variation across early and late 
award sites. 



In the spring of 2005, the study tested students in all eligible classrooms in grades one through three in study schools. In 
subsequent waves of testing, the study tested students in a randomly selected subsample of classrooms in those study schools 
with four or more eligible classrooms per grade, on average, and continued to test all eligible students in eligible classrooms in 
those schools with three or fewer classrooms per grade level, on average. Because not all classrooms (and those classrooms’ 
students) were tested in 2006, it is not possible to determine how many third graders tested in 2007 had also been in study 
schools in both 2005 and 2006. Also, because not all third grade students were tested in all study schools in 2007, this sample 
does not encompass all students who remained in the same type of school (within the study sample) for three years. 



34 



Final Report: Exploratory Analyses 





Exhibit 3.4: Estimated Impacts of Reading First on the Reading Comprehension of Students 
With Three Years of Exposure: Spring 2005-Spring 2007 



Actual 


Estimated 






Mean 


Mean 




Statistical 


with 


without 


Effect 


Significance 


Reading 


Reading 


Size of 


of Impact 


First 


First 


Impact Impact 


(P-value) 



Students With Three Years of Exposure 

Grade 3, Spring 2007 



Reading Comprehension 



Scaled Score 



613.6 609.3 4.3 0.11 (0.223) 



Corresponding Grade Equivalent 3.5 3.3 

Corresponding Percentile 43 39 



NOTES: 

The Three-Year Exposure sample includes 243 schools from 18 sites (17 school districts and 1 state) located in 13 states. 
123 schools are Reading First schools and 120 are non-Reading First schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading 
First Schools pooled across the spring 2005 and 2006 SAT 10 test scores (by grade). 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in 
the “Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF 
schools absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The observed mean reading comprehension score for third-graders with three years of 
exposure to Reading First was 613.6 scaled score points. The estimated mean without Reading First was 609.3 
scaled score points. The impact of Reading First was 4.3 scaled score points (or 0.11 standard deviations), which 
was not statistically significant (p=.223). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing (i.e,, FL, KS, MD, OR) 



Site-by-Site Variation 

If variation in Reading First impacts across study sites exists, it could represent important differences in 
program effectiveness by site, which are masked by average impacts. This variation might help to identify 
conditions under which the program is more (or less) effective. Because the present study was designed 
primarily to estimate average program impacts, there are limits to its statistical power and methodological 
ability to support causal inferences about impact variation. Nevertheless, information from the study 
about impact variation can help to provide a broader context for assessing its findings about average 
impacts. 

Exhibits 3.5 and 3.6 graphically illustrate the impact estimates and 95 percent confidence intervals for 
instructional time in the five dimensions of reading and student test scores by site. This provides a visual 
representation of the variability in impacts as well as the uncertainty that exists about this variability. 



Final Report: Exploratory Analyses 



35 





Exhibit 3.5: Fixed Effect Impact Estimates for Instruction, by Site, by Grade 




NOTES: 

The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) 
in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. 

Impact estimates are statistically adjusted to reflect the regression discontinuity design of the study. 

Boxes in exhibit represent mean impact estimates and lines represent 95 percent confidence intervals for each site. 



SOURCE: RFIS Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006 and spring 2007 




36 Final Report: Exploratory Analyses 









Exhibit 3.6: Fixed Effect Impact Estimates for Reading Comprehension, by Site, by Grade 




NOTES: 



The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) in 
13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. For grade 2 in 2006, one non-RF 
school could not be included in the analysis because test score data were not available. For grade 3 in 2007, one RF school 
could not be included in the analysis because test score data were not available. 

Impact estimates are statistically adjusted (e.g., take each school's rating, site-specific funding cut-point, and other covariates 
into account) to reflect the regression discontinuity design of the study. 

Boxes in exhibit represent mean impact estimates and lines represent 95 percent confidence intervals for each site. 

SOURCES: RFIS SAT 10 administration in the spring of 2007, as well as from state/district education agencies in those sites 
that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR) 




Final Report: Exploratory Analyses 37 








A formal test of whether this variation is real (and whether it is statistically significant at the conventional 
p<.05 level or whether it reflects random error) was conducted for each outcome by grade and then 
pooled across grades (Exhibit 3.7). 

• Estimated impacts on instructional time in the five dimensions per daily reading block ranged 
across site and grade from reductions of more than 20 minutes to increases of more than 20 
minutes. Estimated impacts on reading comprehension scores ranged across sites and grade 
from reductions of nearly 30 scaled score points to increases of more than 35 scaled score 
points. However, formal tests indicated that this site-to-site variation was not statistically 
significant for either outcome, either by grade or overall, for classroom reading instruction or 
student reading comprehension, and therefore do not support the hypothesis that there is 
systematic variation site-to-site. 



Exhibit 3.7: F-Test of Variation in Impacts Across Sites 




Reading 


Reading Instruction 


Comprehension 


(min. in 5 Dimensions) 


(SAT 10 Scaled Score) 



Grade 1 


F-stat 


1.34 


1.424 




p- value 


0.172 


0.114 


Grade 2 


F-stat 


1.31 


1.076 




p- value 


0.190 


0.371 


Grade 3 


F-stat 


n/a 


0.903 




p- value 


n/a 


0.570 


All Available Grades a 


F-stat 


1.47 


1.142 




p-value 


0.108 


0.305 



NOTES: 

The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 
state) located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. For grade 2 in 

2006, one non-RF school could not be included in the analysis because test score data were not available. For grade 3 in 

2007, one RF school could not be included in the analysis because test score data were not available. 

a For Reading Comprehension, grades 1-3 were included in the analysis. For Reading Instruction, only grades 1 and 2 
were included in the analysis because instructional data were only available for these two grades. 

Impact estimates are statistically adjusted (e.g., take into account each school’s rating, site-specific funding cut-point, and 
other covariates into account) to reflect the regression discontinuity design of the study. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The F-statistic for the joint F-test of whether the program impact is the same across all sites 
for first grade reading instruction is 1.34, which was not statistically significant (p=.l 72). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR). RFIS 
Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006 and spring 2007 



Variation in Impacts Between Early and Late Award Sites 

The RFIS Interim Report presented analyses that examined differences among two groups of sites that 
were identified at the outset of the study based on the timing of their grant awards. Early award sites (10 
sites with 111 Reading First schools in the sample) received their initial Reading First grants between 
April and December 2003. Late award sites (8 sites with 137 Reading First schools in the sample) 



38 



Final Report: Exploratory Analyses 










received their initial Reading First grants between January and August 2004. When the data collection 
period for the study ended (in June 2007), early award sites had been funded for an average of 46 months, 
and late award sites had been funded for an average of 37 months. 

The analyses conducted for this report update those from the Interim Report for the two main outcomes 
(reading instruction and reading comprehension) by incoiporating data from the 2006-07 school year (see 
Appendix G). 29 For minutes of instruction in the five dimensions, Exhibit 3.8 indicates statistically 
significant impacts for late award sites, but not early award sites. For reading comprehension, as 
measured by scaled scores on the SAT 10, Exhibit 3.9 indicates no statistically significant impacts for 
early award sites and only one statistically significant impact (in Grade Two) for late award sites. 

There is no statistically significant difference between estimated impacts in late award versus early award 
sites in minutes of instruction in the five dimensions for either Grade One or Grade Two (Exhibit 3.10). 
The composite test (on an index that combines the three instructional outcomes and pools data from first 
and second grades) of differences between the two groups of sites was, however, statistically significant. 
The difference between estimated impacts in late award versus early award sites for average scaled scores 
in student reading comprehension was statistically significant for only Grade Two (Exhibit 3.10). The 
composite test (on an index that combines scaled scores and indicators of students’ at or above grade level 
performance and pools data across three grades) was not statistically significant. The inconsistent findings 
do not support the hypothesis that there is systematic variation across early and late award sites. 

Exploring the Relationship between Classroom Reading Instruction 
and Student Achievement 

The study provides a rigorous test of the extent to which the receipt of RF funding at the school level had 
an impact on instruction and reading achievement. However, another question of interest is whether the 
scientifically based reading instruction promoted by RF is related to student achievement, regardless of 
where it is implemented. Although the study design does not support a causal analysis of this question, 
the relationship between the study’s instructional data and the study’s achievement data (for grades one 
and two only) can be estimated using correlational techniques. 

This section, therefore, explores the following research question: What is the relationship between the 
degree of implementation of scientifically based reading instruction and student achievement? by using 
hierarchical linear modeling to explore the observed correlations between instructional practices and 
student achievement in the RFIS sample of schools. These analyses are outside the causal research design 
(i.e., regression discontinuity design) described in Chapter Two, and can therefore provide evidence only 
about observed statistical associations between classroom instruction and student achievement in the 
study sample. 



29 This specific set of analyses was not conducted for the Student Engagement with Print measure. 



Final Report: Exploratory Analyses 



39 





Exhibit 3.8: Estimated Impacts on Classroom Instruction: 2005, 2006, and 2007 (pooled), by Award 
Status 





Actual 

Mean 

with 

Reading 

First 


Estimated 

Mean 

without 

Reading 

First 


Impact 


Effect 
Size of 
Impact 


Statistical 
Significance 
of Impact 
(p-value) 


Early Award Sites 

Number of minutes of instruction in the 
five dimensions combined 

Grade 1 


62.02 


60.00 


2.02 


0.10 


0.640 


Grade 2 


63.04 


57.49 


5.55 


0.26 


0.223 


Percentage of intervals in five dimensions 
with highly explicit instruction 

Grade 1 


29.90 


26.12 


3.78 


0.21 


0.067 


Grade 2 


31.34 


31.38 


-0.04 


0.00 


0.987 


Percentage of intervals in five dimensions 
with High Quality Student Practice 

Grade 1 


18.18 


20.06 


-1.88 


-0.11 


0.336 


Grade 2 


17.66 


14.14 


3.53 


0.20 


0.073 


Late Award Sites 

Number of minutes of instruction in the 
five dimensions combined 

Grade 1 


57.04 


46.30 


10.74* 


0.52* 


<0.001 


Grade 2 


55.98 


42.90 


13.08* 


0.62* 


<0.001 


Percentage of intervals in five dimensions 
with highly explicit instruction 

Grade 1 


28.98 


25.98 


3.01 


0.17 


0.109 


Grade 2 


30.65 


25.25 


5.40* 


0.28* 


0.004 


Percentage of intervals in five dimensions 
with High Quality Student Practice 

Grade 1 


18.63 


15.70 


2.93 


0.17 


0.073 


Grade 2 


17.95 


15.41 


2.54 


0.14 


0.113 



NOTES: 



The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) 
located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. There are 8 late award sites, 
with 137 schools, and 10 early award sites, with 111 schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools pooled across the spring 2005, fall 2005, and spring 2006 IPRI data (by grade). 

Impact estimates are statistically adjusted to reflect the regression discontinuity design of the study. 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools absent 
RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The observed mean amount of time spent in instruction in the five dimensions (phonemic awareness, 
phonics, vocabulary, fluency, and comprehension) in first grade classrooms with Reading First in early award sites was 
62.02 minutes. The estimated mean amount of time without Reading First was 60.00 minutes. The impact of Reading 
First on the amount of time spent in instruction in the five dimensions was 2.02 minutes (or 0.10 standard deviations), 
which was not statistically significant (p=.640). 

SOURCES: RFIS Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006 and spring 2007 



40 



Final Report: Exploratory Analyses 







Exhibit 3.9: Estimated Impacts on Reading Comprehension: Spring 2005, 2006, and 2007 
(pooled), by Award Status 





Actual 


Estimated 










Mean 


Mean 






Statistical 




with 


without 




Effect 


Significance 




Reading 


Reading 




Size of 


of Impact 




First 


First 


Impact 


Impact 


(p-value) 


Early Award Sites 












Reading Comprehension 












Grade 1: Scaled Score 


546.6 


543.8 


2.9 


0.06 


(0.569) 


Corresponding Grade Equivalent 


1.8 


1.7 








Corresponding Percentile 


47 


44 








Grade 2: Scaled Score 


587.4 


591.8 


-4.4 


-0.10 


(0.287) 


Corresponding Grade Equivalent 


2.6 


2.7 








Corresponding Percentile 


41 


45 








Grade 3: Scaled Score 


613.1 


617.0 


-3.9 


-0.10 


(0.343) 


Corresponding Grade Equivalent 


3.5 


3.6 








Corresponding Percentile 


43 


46 








Late Award Sites 












Reading Comprehension 












Grade 1 : Scaled Score 


541.6 


536.0 


5.6 


0.11 


(0.061) 


Corresponding Grade Equivalent 


1.7 


1.6 








Corresponding Percentile 


43 


39 








Grade 2: Scaled Score 


582.1 


576.1 


6.0* 


0.14* 


(0.021) 


Corresponding Grade Equivalent 


2.4 


2.3 








Corresponding Percentile 


38 


33 








Grade 3: Scaled Score 


606.0 


602.4 


3.5 


0.09 


(0.108) 


Corresponding Grade Equivalent 


3.1 


3.0 








Corresponding Percentile 


36 


34 








NOTES: 



The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) 
located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. Among them, there are 8 
late award sites, with 137 schools, and 10 early award sites, with 111 schools. For grade 2 in 2006, one non-RF school could 
not be included in the analysis because test score data were not available. For grade 3 in 2007, one RF school could not be 
included in the analysis because test score data were not available. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools pooled across the spring 2005 and 2006 SAT 10 test scores (by grade). 

Impact estimates are statistically adjusted (e.g., take each school's rating, site-specific funding cut-point, and other covariates 
into account) to reflect the regression discontinuity design of the study. 

Values in the “Actual Mean with Reading First” column are actual, unadjusted values for Reading First schools; values in the 
“Estimated Mean without Reading First” column represent the best estimates of what would have happened in RF schools 
absent RF funding and are calculated by subtracting the impact estimates from the RF schools’ actual mean values. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

EXHIBIT READS: The observed mean reading comprehension score for first-graders with Reading First in the late 
award sites was 541.6 scaled score points. The estimated mean without Reading First was 536.0 scaled score points. 
The impact of Reading First was 5.6 scaled score points (or 0.11 standard deviations), which was not statistically 
significant (p=.061). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR) 



Final Report: Exploratory Analyses 



41 







Exhibit 3.10: Award Group Differences in Estimated Impacts on Reading Comprehension and 
Classroom Instruction: 2005, 2006, and 2007 (pooled) 





Difference in 
Impact 
(Early - Late) 


Effect Size of 
Difference 


Statistical 
Significance of 
Differences 
(p-value) 


Average Scaled Score 

Grade 1 


-2.8 


-0.06 


(0.636) 


Grade 2 


-10.4* 


-0.25* 


(0.032) 


Grade 3 


-7.4 


-0.19 


(0.110) 


Number of minutes spent in instruction in five 
dimensions combined 

Grade 1 


-8.72 


-0.42 


(0.092) 


Grade 2 


-7.53 


-0.35 


(0.155) 


Percentage of observation intervals in five 

dimensions with 

Highly Explicit Instruction 

Grade 1 


0.78 


0.04 


(0.779) 


Grade 2 


-5.44 


-0.28 


(0.068) 


High Quality Student Practice 

Grade 1 


-4.81 


-0.29 


(0.059) 


Grade 2 


0.98 


0.05 


(0.696) 


NOTES: 









The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) 
located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. There are 8 late award sites, 
with 137 schools, and 10 early award sites, with 111 schools. 

The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
schools pooled across the spring 2005 and 2006 data (by grade). 

Impact estimates are statistically adjusted (e.g., take each school's rating, site-specific funding cut-point, and other covariates 
into account) to reflect the regression discontinuity design of the study. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 

A composite test on an index that combines scaled scores and indicators of students’ at or above grade level performance and 
pools data across three grades of differences between early and late sites was not statistically significant (p=.082). 

A composite test on an index that combines the three instructional outcomes and pools data from first and second grades of 
differences between early and late sites was statistically significant (p=.037). 

EXHIBIT READS: The estimated difference in impact between early and late award sites in grade 1 was -2.8 scaled 
score points. The effect size of the difference was -0.06 standard deviations. The estimated difference was not statistically 
significant (p=.636). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006, and 2007 as well as from state/district education agencies 
in those sites that already use the SAT 10 for their standardized testing (i.e,, FL, KS, MD, OR): RFIS Instructional Practice in 
Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 



Specifically, this section examines statistical associations between several aspects of reading instruction, 
each of which is developed from observational data collected using the study’s Instructional Practice in 
Reading Inventory, and student reading achievement, based on students’ test scores on the reading 
comprehension subtest of the Stanford Achievement Test, 10 th Edition (SAT 10). The measures of reading 
instruction used in this analysis are the same as those selected to represent the degree of implementation 
of scientifically based reading instruction in Chapter Two of this report. They include: 



42 



Final Report: Exploratory Analyses 









• average time spent per daily reading block in the five core dimensions of scientifically based 
reading instruction combined (referred to as “time in the five dimensions ”), 30 

• average time spent per daily reading block in each of the five dimensions of scientifically 
based reading instruction (phonemic awareness, phonics, vocabulary, fluency and 
comprehension) separately, 

• the proportion of three-minute time intervals during reading instruction in the five dimensions 
of reading instruction that involve highly explicit instruction (referred to as “ highly explicit 
instruction”), and 

• the proportion of three-minute time intervals during reading instruction in the five dimensions 
of reading instruction that involve high quality student practice (referred to as “ high quality 
student practice”). 

This section also presents supplementary analyses that test whether there are other factors that might 
account for any observed relationship between the predictors outlined above and student reading 
comprehension. The study cannot possibly account for the complete set of alternative predictors in these 
models because it did not measure all the variables that are possibly related to both instruction and 
comprehension; nonetheless, three variables thought to be the most compelling are explored. 

All analyses are conducted using data from all schools included in the study: those with and without 
Reading First funding, without accounting for treatment group. Results are also presented separately by 
treatment status. Instructional variables from classroom observations and the SAT 10 test scores from all 
three years of data collection (2005, 2006, and 2007) are included in these analyses. The unit of 
observation is the classroom within a given school year. 11 In Year One, the classroom instruction 
measures are derived from classroom observations conducted in the spring of 2005; in Years Two and 
Three, they represent the average of the fall and spring observations. 

Caveats 

The results described below should be interpreted with considerable caution. These analyses are outside 
the causal research design (i.e., regression discontinuity design) described in Chapter Two, and so do not 
provide evidence of a causal link between instructional practices and student reading comprehension. 

Estimation Model 

The analyses use a two-level hierarchical linear model to account for the repeated measures within 
classrooms, as well as indicator variables for schools to account for the nesting of classrooms within 
schools. More specifically, covariates in the models include: 

• site indicators, 



30 These five dimensions of reading instruction (phonemic awareness, phonics, vocabulary, fluency and comprehension) are 
outlined in the Reading First legislation and in the guidance provided to states about Reading First. 

31 A ‘classroom’ is defined as having the same teacher at the same grade level in the same school. Since some teachers moved to 
other schools, and some to other grades within the same school over the study’s three years of data collection, all classrooms 
are not necessarily represented in multiple years. 



Final Report: Exploratory Analyses 



43 





• school indicators, 

• percentage of male students in the classroom, 

• classroom level average of student age at start of school year, 

• date of the post-test at the classroom level, 

• school-level pre -program reading performance measure. 12 

In order to account for possible modeling differences associated with the year of data collection, all of the 
covariates (except school indicators) are interacted with indicators for each data collection period. 33 Site 
indicators are interacted with the predictors and covariates to allow the estimation of separate regression 
coefficients in each site. Each regression coefficient is then weighted according to the number of RF 
schools in the site prior to averaging across sites. 

The multi-level model presented in (1) below estimates the degree to which variation in a particular 
predictor (PRE t j) is associated with variation in the mean classroom-level reading comprehension test 
scores, controlling for the covariates listed above. For each grade, the model takes the following form: 

Ytjkm — jjlhnS TmkYR t + J3\mS TmkP R /'. + JSjkSCjk + J3jm Sink Y_ lkm YR, (1) 

mt m k mt 

+ ^ yZjkYR t + dnXnijkmYR t + Ojk + Stjk 

t nt 

where: 

Ytjkm = the average post-test score in year t, for classroom j, in school k, in site m, 

STmk = one if school k is in site m and zero otherwise, m = 1 to 18, 

PREtj = value of the predictor of interest in classroom j in year t, 

SCjk = the indicator variable for school k. In other words, it equals one if classroom j is in school 
k and zero otherwise, k = 1 to 248, 

Y_ xkm = the mean baseline pretest for school k (standardized and centered by site), 

YR t = indicator for follow-up years; 2005, 2006 or 2007, 

Z t jk = a variable indicating when the post-test in year t was given for classroom j in school k (site- 
centered), 

Xnijkm = classroom average of the n th demographic student characteristic in classroom j in school 
k, in site m 

Ojk and Sijk = classroom- level random error term and the residual, respectively, assumed to be 
independently and identically distributed. 



32 Different pre-program perfonnance measures were constructed for early and late award sites. For the ten early award sites and 
one late award site (which had no fall 2004 test data due to a hurricane), perfonnance on a state reading test (when available, 
an average of test scores from up to three prc-RF years) was used as a school level pretest measure. For late award sites except 
for the one without available fall 2004 data, the mean fall 2004 SAT 10 test scores for each school/grade were used as the 
pretest measure. 

33 This accounts for year-to-year variation in the levels of the outcome measure as well as the relationship between covariates and 
outcome measures. 



44 



Final Report: Exploratory Analyses 





The average estimated value of (i\m (m = 1,2, 18), weighted by the number of RF schools in each 

site, captures the overall relationship between student test scores and the predictor of interest. 34 An 
important distinction between the model described here and those employed for the main impact analyses 
is the use of school level indicators in place of the rating variable. These school level indicators were 
introduced to control for unobservable and time-invariant school characteristics that affected the outcome 
and the predictors. 



Findings 



Descriptive statistics and bivariate correlations between all of the predictors as well as the outcome are 
presented in Exhibits 3.11 and 3.12. Correlation coefficients between the outcome and predictors range 
from -0.06 to 0.27, and from -0.00 to 0.30 for grades one and two, respectively. 

The remainder of this section presents estimates of the relationship between student reading 
comprehension and the key measures of instruction listed above. First, the association between student 
reading comprehension and time spent on each of the five dimensions of reading instruction (phonemic 
awareness, phonics, comprehension, vocabulary, and fluency) was examined (Exhibit 3.13, Models I-V). 
A sixth model estimated the relationship between all five dimensions and comprehension; this model 
explores the relationship between comprehension and the time spent on a specific dimension controlling 
for the time spent on the other four dimensions. These analyses were conducted separately for grades one 
and two. Findings indicate that: 

• In grade one, when tested individually, time spent on comprehension and vocabulary were 
both significantly and positively related to student achievement. Specifically, a one-minute 
difference per daily reading block in the time spent on comprehension is associated with a 
0.15 scaled score point difference in student achievement, and a one-minute difference per 
daily reading block in the time spent on vocabulary is associated with a 0.22 point difference 
in student reading comprehension. 

• Time spent on phonics in grade one, however, was significantly and negatively related to 
student reading comprehension. In particular, a one-minute difference per daily reading block 
in the time spent on phonics per daily reading block was associated with a -0. 10 point 
difference in student test scores. 

• In the model that tested the joint association between reading achievement and time spent on 
each dimension in grade one, only time spent on comprehension remained a significant 
predictor. 

• In grade two, time spent on phonics was significantly and negatively related to student 
reading comprehension. Similar to the finding in grade one, a one-minute difference per daily 
reading block in the time spent on phonics was associated with a -0.15 point difference in 
student test scores. 

• Time spent on comprehension was also significantly related to student reading 
comprehension in grade two, such that a one-minute difference per daily reading block in the 
time spent on comprehension was associated with a 0.12 point difference in student reading 
comprehension. 



34 Note that models that jointly tested multiple predictors were also estimated. In such cases, the overall relational coefficient for 
each predictor was calculated in a similar manner. 



Final Report: Exploratory Analyses 



45 





Exhibit 3.11: Descriptive Statistics 




Mean 


Std Dev 


N 


Min 


Max 


Panel A: GRADE 1 


SAT10 Test Score 


544.7 


23.2 


2199 


423.0 


629.7 


Minutes spent on... 












Phonemic Awareness 


1.64 


2.35 


2199 


0.00 


22.59 


Phonics 


19.21 


11.25 


2199 


0.00 


63.99 


Comprehension 


21.95 


11.73 


2199 


0.00 


72.26 


Vocabulary 


7.17 


5.18 


2199 


0.00 


31.82 


Fluency 


4.22 


5.18 


2199 


0.00 


44.74 


Five dimensions combined 


54.19 


18.36 


2199 


0.00 


132.15 


Percentage of Intervals in the five dimensions 
with highly explicit instruction 


28.48 


13.88 


2199 


0.00 


78.46 


Percentage of Intervals in the five dimensions 
with high quality student practice 


17.89 


12.18 


2199 


0.00 


81.53 


Observation length 


108.57 


26.71 


2199 


30.00 


237.75 


Gats score 


4.40 


0.58 


1403 


1.98 


5.00 


Percentage of students engaged with print 


46.26 


22.49 


1399 


0.00 


100.00 


Pretest (Z-scored) 


0.01 


1.02 


2199 


-4.47 


2.71 


Panel B: GRADE 2 


SAT10 Test Score 


586.1 


19.0 


2133 


515.7 


664.3 


Minutes spent on... 












Phonemic Awareness 


0.39 


0.99 


2133 


0.00 


15.27 


Phonics 


11.41 


9.04 


2133 


0.00 


59.69 


Comprehension 


26.70 


13.24 


2133 


0.00 


91.20 


Vocabulary 


10.32 


6.67 


2133 


0.00 


57.83 


Fluency 


3.57 


4.65 


2133 


0.00 


43.77 


Five dimensions combined 


52.37 


18.28 


2133 


5.01 


123.84 


Percentage of Intervals in the five dimensions 
with highly explicit instruction 


29.99 


14.30 


2133 


0.00 


92.15 


Percentage of Intervals in the five dimensions 
with high quality student practice 


17.38 


11.99 


2133 


0.00 


72.31 


Observation length 


106.15 


26.43 


2133 


36.75 


210.00 


Gats score 


4.41 


0.59 


1371 


1.40 


5.00 


Percentage of students engaged with print 


50.88 


22.40 


1363 


0.00 


100.00 


Pretest (Z-scored) 


0.01 


1.02 


2133 


-3.92 


2.89 


NOTES: 












The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 
state) located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. For grade 2 in 


2006, one non-RF school could not be included in the analysis because test score data were not available. 




EXHIBIT READS: The mean grade one SAT 10 score was 544.7, with a standard deviation of 23.2 across 2,199 


observations. The minimum score was 423.0, and the maximum score was 629.7. 






SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, 


as well as from state/district education 


agencies in those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR): RFIS 
Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007; RFIS 
Global Appraisal of Teaching Strategies, fall 2005, spring 2006, fall 2006, and spring 2007; RFIS Student Time-on- 


Task and Engagement with Print, fall 2005, spring 2006, fall 2006, and spring 2007 







46 



Final Report: Exploratory Analyses 








Final Report: Exploratory Analyses 



Exhibit 3.12: Bivariate Correlation Coefficients between Test Scores and Predictors 

Panel A.GRADE 1 





SAT10 
Test Score 


Minutes 
spent on 
Phonemic 
Awareness 


Minutes 
spent on 
Phonics 


Minutes 
spent on 
Comprehen- 
sion 


Minutes 
spent on 
Vocabulary 


Minutes 
spent on 
Fluency 
Building 


Minutes 
spent on 
the Five 
Dimensions 
Combined 


Percentage 
of Intervals 
in the five 
dimensions 
with highly 
explicit 
instruction 


Percentage 
of Intervals in 
the five 
dimensions 
with high 
quality 
student 
practice 


Observation 

length 


GATS 

score 


SAT10 Test Score 


Minutes spent on 

Phonemic 

Awareness 


-0.053 






















Minutes spent on 
Phonics 


-0.063 


0.177 




















Minutes spent on 
Comprehension 


0.150 


-0.066 


-0.132 


















Minutes spent on 
Vocabulary 


0.091 


0.065 


0.079 


0.165 
















Minutes spent on 
Fluency Building 


0.068 


-0.049 


0.062 


0.052 


0.001 














Minutes spent on 
the Five 
Dimensions 
Combined 


0.095 


0.199 


0.590 


0.610 


0.444 


0.347 












Percentage of 
Intervals in the 
five dimensions 
with highly explicit 
instruction 


0.074 


0.144 


0.164 


0.009 


0.370 


-0.092 


0.202 










Percentage of 
Intervals in the 
five dimensions 
with high quality 
student practice 


0.055 


0.197 


0.136 


-0.015 


0.045 


0.207 


0.170 


0.183 








Observation 

length 


0.017 


0.088 


0.314 


0.335 


0.262 


0.222 


0.554 


-0.030 


-0.004 






GATS score 


0.269 


0.060 


0.160 


0.092 


0.165 


0.073 


0.228 


0.183 


0.195 


-0.005 




Percentage of 
students engaged 
with print 


0.174 


-0.077 


0.065 


-0.022 


-0.005 


0.139 


0.047 


0.066 


0.109 


-0.095 


0.165 



-si 










Final Report: Exploratory Analyses 



00 



Exhibit 3.12: Bivariate Correlation Coefficients between Test Scores and Predictors (continued) 



Panel B .GRADE 2 





SAT10 
Test Score 


Minutes 
spent on 
Phonemic 
Awareness 


Minutes 
spent on 
Phonics 


Minutes 
spent on 
Comprehen- 
sion 


Minutes 
spent on 
Vocabulary 


Minutes 
spent on 
Fluency 
Building 


Minutes 
spent on 
the Five 
Dimensions 
Combined 


Percentage 
of Intervals 
in the five 
dimensions 
with highly 
explicit 
instruction 


Percentage 
of Intervals in 
the five 
dimensions 
with high 
quality 
student 
practice 


Observation 

length 


GATS 

score 


SAT10 Test Score 


Minutes spent on 

Phonemic 

Awareness 


-0.003 






















Minutes spent on 
Phonics 


-0.129 


0.210 




















Minutes spent on 
Comprehension 


0.093 


-0.078 


-0.136 


















Minutes spent on 
Vocabulary 


0.027 


0.008 


0.073 


0.138 
















Minutes spent on 
Fluency Building 


-0.030 


0.015 


0.100 


0.010 


-0.033 














Minutes spent on 
the Five 
Dimensions 
Combined 


0.006 


0.108 


0.459 


0.705 


0.493 


0.300 












Percentage of 
Intervals in the 
five dimensions 
with highly explicit 
instruction 


0.123 


0.102 


0.079 


0.072 


0.369 


-0.085 


0.210 










Percentage of 
Intervals in the 
five dimensions 
with high quality 
student practice 


0.059 


0.123 


0.155 


0.072 


0.075 


0.152 


0.201 


0.232 








Observation 

length 


-0.091 


0.033 


0.288 


0.370 


0.246 


0.177 


0.547 


-0.014 


-0.041 






GATS score 


0.303 


0.015 


0.072 


0.199 


0.136 


0.096 


0.247 


0.220 


0.220 


-0.038 




Percentage of 
students engaged 
with print 


0.173 


-0.030 


0.027 


0.005 


-0.057 


0.069 


0.010 


0.069 


0.011 


-0.091 


0.190 










Exhibit 3.13: Regression Coefficients for the Relationship between Classroom Reading 
Instruction and Reading Comprehension 



Panel A: GRADE 1 

Minutes in... 

Phonemic Awareness 
Phonics 

Comprehension 

Vocabulary 

Fluency 

Panel B: GRADE 2 

Minutes in... 

Phonemic Awareness 
Phonics 

Comprehension 

Vocabulary 

Fluency 



NOTES: 



I II III IV V VI 



- 0.220 - 0.102 
(0.316) - - - - (0.656) 

-0.103* -0.072 

(0.024) ' ' ' (0.135) 

0.148* 0.131* 

(<0.001 ) ' ' (0.005) 

0.219* 0.175 

(0.017) ' (0.062) 

0.146 0.148 

' ' ' (0.206) (0.212) 



-0.128 0.158 

(0.769) - - - - (0.729) 

-0.150* -0.138* 

(<0.001 ) ' ' ' (0.003) 

0.115* 0.099* 

(<0.001 ) ' ' (0.002) 
0.086 0.084 

(0.139) ' (0.159) 

0.004 0.074 

' ' ' ‘ (0.966) (0.443) 



Sample sizes for grade 1 and 2 analyses are 2,199 and 2,133 classrooms, respectively. The complete Reading First 
Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) located in 13 states. 125 
schools are Reading First schools and 123 are non-Reading First schools. For grade 2 in 2006, one non-RF school could 
not be included in the analysis because test score data were not available. 

A two-tailed test of significance was used, and where applicable, statistically significant findings at the p<.05 level are 
indicated by *. P-values are in parentheses. 

EXHIBIT READS: For grade 1, the regression coefficient between minutes spent teaching phonemic awareness 
and student achievement is -.22, which means that a one-minute difference in the amount of time spent teaching 
phonemic awareness per daily reading block is associated with a -0.22 point difference in student test scores. This 
association is not statistically significant (p=0.316). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR): RFIS 
Instructional Practice in Reading Inventoiy spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 



• These two predictors remained significant in the specification that tested all five predictors 
jointly in grade two. 

These analyses were also run separately by treatment status to see whether the relationship between 
instruction and comprehension differed between the two groups of schools. As shown in Exhibits 3.14 
and 3.15, except in phonics in grade one (p=.035), there are no statistically significant differences in the 
estimates for the treatment and comparison groups in either grade. However, note that in Exhibit 3.14, 
Model II, in which phonics is included on its own, the difference between the estimated coefficients for 
the treatment and the comparison groups is not statistically significant. Overall, therefore, the results 



Final Report: Exploratory Analyses 



49 









suggest that the estimated relationship between student reading comprehension and key measures of 
reading instruction do not differ across the treatment and comparison groups. 



Exhibit 3.14: Regression Coefficients Between Classroom Reading Instruction and 




Reading Comprehension by Treatment Status— 


-Grade 1 










1 


II 


III 


IV 


V 


VI 


Panel A: Treatment Group 


Minutes in... 














Phonemic Awareness 


-0.401 


- 


- 


- 


- 


-0.185 




(0.176) 










(0.555) 


Phonics 


- 


-0.182* 


- 


- 


- 


-0.160* 






(0.006) 








(0.027) 


Comprehension 


- 


- 


0.143* 


- 


- 


0.076 








(0.039) 






(0.308) 


Vocabulary 


- 


- 


- 


0.226 


- 


0.186 










(0.076) 




(0.168) 


Fluency 


- 


- 


- 


- 


0.100 


0.171 












(0.546) 


(0.331) 


Panel B: Comparison Group 


Minutes in... 














Phonemic Awareness 


0.121 


- 


- 


- 


- 


0.237 




(0.771) 










(0.590) 


Phonics 


- 


0.003 


- 


- 


- 


0.064 






(0.965) 








(0.409) 


Comprehension 


- 


- 


0.143* 


- 


- 


0.169* 








(0.028) 






(0.018) 


Vocabulary 


- 


- 


- 


0.051 


- 


-0.012 










(0.732) 




(0.940) 


Fluency 


- 


- 


- 


- 


0.279 


0.332 












(0.152) 


(0.128) 


Panel C: P-values from t-tests comparing treatment and comparison estimates 


Minutes in... 














Phonemic Awareness 


0.307 


- 


- 


- 


- 


0.434 


Phonics 


- 


0.057 


- 


- 


- 


0.035* 


Comprehension 


- 


- 


1.000 


- 


- 


0.367 


Vocabulary 


- 


- 


- 


0.372 


- 


0.339 


Fluency 


- 


- 


- 


- 


0.484 


0.565 



NOTES: 



Sample size for grade 1 analysis is 2,199 classrooms. The complete Reading First Impact Study sample includes 248 
schools from 18 sites (17 school districts and 1 state) located in 13 states. 125 schools are Reading First schools and 
123 are non-Reading First schools. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. In 
panels A and B, p-values are in parentheses. 

EXHIBIT READS: For the treatment group in grade 1, the regression coefficient between minutes in phonemic 
awareness and student achievement is -.401, which means that a one-minute difference in the time spent 
teaching phonemic awareness per daily reading block is associated with a -0.40 point difference in student test 
scores. This association is not statistically significant (p=.176). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006, and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR): RFIS 
Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 



50 



Final Report: Exploratory Analyses 









Exhibit 3.15: Regression Coefficients Between Classroom Reading Instruction and 




Reading Comprehension by Treatment Status— 


-Grade 2 










1 


II 


III 


IV 


V 


VI 


Panel A: Treatment Group 


Minutes in... 














Phonemic Awareness 


0.025 


- 


- 


- 


- 


0.541 




(0.970) 










(0.451) 


Phonics 


- 


-0.073 


- 


- 


- 


-0.027 






(0.270) 








(0.709) 


Comprehension 


- 


- 


0.102* 


- 


- 


0.097 








(0.031) 






(0.066) 


Vocabulary 


- 


- 


- 


0.078 


- 


0.056 










(0.347) 




(0.528) 


Fluency 


- 


- 


- 


- 


-0.067 


-0.004 












(0.626) 


(0.978) 


Panel B: Comparison Group 


Minutes in... 














Phonemic Awareness 


-0.748 


- 


- 


- 


- 


-0.633 




(0.523) 










(0.626) 


Phonics 


- 


-0.063 


- 


- 


- 


-0.062 






(0.423) 








(0.466) 


Comprehension 


- 


- 


0.147* 


- 


- 


0.123* 








(0.001) 






(0.013) 


Vocabulary 


- 


- 


- 


0.126 


- 


0.112 










(0.161) 




(0.228) 


Fluency 


- 


- 


- 


- 


0.229 


0.329 












(0.160) 


(0.056) 


Panel C: P-values from t-tests comparing treatment and comparison estimates 


Minutes in... 














Phonemic Awareness 


0.568 


- 


- 


- 


- 


0.418 


Phonics 


- 


0.922 


- 


- 


- 


0.754 


Comprehension 


- 


- 


0.496 


- 


- 


0.718 


Vocabulary 


- 


- 


- 


0.695 


- 


0.663 


Fluency 


- 


- 


- 


- 


0.166 


0.145 



NOTES: 



Sample size for grade 2 analysis is 2,133 classrooms. The complete Reading First Impact Study sample includes 248 
schools from 18 sites (17 school districts and 1 state) located in 13 states. 125 schools are Reading First schools and 
123 are non-Reading First schools. For grade 2 in 2006, one non-RF school could not be included in the analysis 
because test score data were not available. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. In 
panels A and B p-values are in parentheses. 

EXHIBIT READS: For the treatment group in grade 2, the regression coefficient between minutes in phonemic 
awareness and student achievement is .025, which means that a one-minute difference in the time spent teaching 
phonemic awareness per daily reading block is associated with a 0.03 point difference in student test scores. This 
association is not statistically significant (p=.970). 

SOURCES : RFIS SAT 10 administration in the spring of 2005, 2006, and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing (i.e,, FL, KS, MD, OR); RFIS 
Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 



Final Report: Exploratory Analyses 



51 








Next, the associations between student reading comprehension and three more broadly defined measures 
of reading instruction were examined (Exhibit 3.16). These measures are total time spent on the five 
dimensions, percentage of classroom observation intervals in which teachers used highly explicit 
instructional strategies associated with the five dimensions, and percentage of intervals in which students 
were provided with high quality reading practice. First, three models were fit using each measure as a 
predictor of student reading comprehension separately. Then, all three measures were included together in 
a fourth model. 



Exhibit 3.16: Regression Coefficients Between Broadly Defined Measures of Classroom 
Instruction and Reading Comprehension 



I [I Ml IV 

Panel A: GRADE 1 

Minutes in the five dimensions 0.073* - - 0.073* 

(0.014) (0.019) 

Percentage of Intervals in the five dimensions - -0.023 - -0.039 

with highly explicit instruction (0.479) (0.247) 

Percentage of Intervals in the five dimensions - - 0.040 0.038 

with high quality student practice (0.270) (0.311) 

Panel B: GRADE 2 

Minutes in the five dimensions 

Percentage of Intervals in the five dimensions 
with highly explicit instruction 

Percentage of Intervals in the five dimensions 
with high quality student practice 

NOTES: 

These analyses use available data from all years (Grade 1 and Grade 2 analysis sample sizes are 2,199 and 2,133 
classrooms, respectively). The complete Reading First Impact Study sample includes 248 schools from 18 sites (17 
school districts and 1 state) located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First 
schools. For grade 2 in 2006, one non-RF school could not be included in the analysis because test score data were not 
available. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. P- 
values are in parentheses. 

EXHIBIT READS: For grade 1, the regression coefficient between minutes spent teaching the five dimensions of 
reading and student achievement is .073, which means that a one-minute difference in the amount of time spent 
teaching the five dimensions of reading per daily reading block is associated with a 0.07 point difference in 
student test scores. This association is statistically significant (p=.014). 

SOURCES : RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR); RFIS 
Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 



0.051* - - 0.058* 

(0.034) (0.023) 

0.007 - -0.004 

(0.778) (0.886) 

0.022 0.008 

(0.450) (0.790) 



When tested individually, total time spent on the five dimensions of reading was significantly and 
positively related to reading achievement in both grades. As Model I in Exhibit 3.16, Panel A shows, a 
one-minute difference in the total time spent on five dimensions per daily reading block was associated 
with a 0.07 point difference in student test scores in grade one. In grade two, a one-minute difference in 
time spent teaching the five dimensions per daily reading block was associated with a 0.05 point 
difference in student test scores. When tested jointly with the other two main predictors of interest, the 
same relationship was observed between total time spent on the five dimensions of reading and student 
reading comprehension in both grades (Model IV). Results of these analyses run separately by treatment 



52 



Final Report: Exploratory Analyses 










status indicate that there are no statistically significant differences across the two groups of schools (see 
Exhibit 3.17). 

The previous analysis suggests that time spent in the five dimensions of reading is positively related to 
levels of student reading comprehension. However, it is quite possible that some other variable(s), not 
included in these models, may actually account for the observed relationship. For example, teachers who 
spend more time on the five dimensions of reading may simply devote more time to reading, have more 
organized classrooms, or have students who spend more classroom time engaged with print material. 
Therefore, in addition to the primary predictors, three other measures — length of the reading block, a 
global measure of instructional quality (instructional organization and order), and percentage of students 
engaged with print — were also tested as alternative predictors of student reading comprehension. 

Because two of the alternative predictors (instructional organization and order and percentage of students 
engaged with print) were not collected in the first study year, the model that jointly tested the three main 
predictors was re-estimated on two subsamples of 1,399 Grade One and 1,363 Grade Two classrooms for 
which all six predictors (three main and three alternative) were available (Exhibit 3.18, Model I). All 
further analyses were conducted using this subsample. 

Since the subsamples used to estimate Model I in Exhibit 3.18 are substantially different (and only about 
two-thirds as large) as the full samples used to estimate Model IV in Exhibit 3.16, the results of analyses 
using the subsamples should be interpreted with caution. We cannot know whether we would have 
observed the same pattern of results if we had been able to use the full sample for these analyses. For 
example, Exhibits 3.16 and 3.18 indicate that even before adding the alternative predictors to the model, 
the relationships are substantively different when estimating with the subsample rather than the full 
sample, such that the relationship between minutes spent in the five dimensions of reading is no longer 
statistically significant in either first or second grade in the subsample. In addition, in first grade, the 
relationship between highly explicit instruction is negative and statistically significant and the relationship 
between high quality student practice is positive and statistically significant in the subsample, when 
neither was statistically significant in the full sample. 

The alternative hypotheses were tested by estimating a single model that included all six primary and 
secondary predictors (Exhibit 3.18, Model II). The exhibit presents separate estimates from these analyses 
for grades one and two. 

• In grade one, when jointly tested using the classrooms for which all six predictors were 
available, one of the primary predictors (the measure accounting for the presence of highly 
explicit instruction in the five dimensions) was significantly linked to achievement. More 
specifically, a one-percentage point difference in number of the intervals that included highly 
explicit instruction in the five dimensions was related to a -0.14 points difference in student 
test scores. 

• None of the three primary predictors were statistically significantly related to student test 
scores in grade two, when the model was estimated with all six predictors. 



Final Report: Exploratory Analyses 



53 





Final Report: Exploratory Analyses 



Exhibit 3.17: Regression Coefficients Between Broadly Defined Measures of Classroom Instruction and Reading Comprehension by 
Grade and Treatment Status 






1 


II 


Ill 


IV 


V 


VI 


VII 


VIII 


GRADE 1 


GRADE 2 


Panel A: Treatment Group 




Minutes in the five dimensions 


0.042 

(0.318) 


- 


- 


0.032 

(0.488) 


0.075* 

(0.043) 


- 


- 


0.093* 

(0.018) 


Percentage of Intervals in the five dimensions 




-0.015 




-0.016 




-0.005 




-0.030 


with highly explicit instruction 




(0.757) 




(0.755) 




(0.903) 




(0.485) 


Percentage of Intervals in the five dimensions 






0.053 


0.071 






0.040 


0.011 


with high quality student practice 






(0.321) 


(0.209) 






(0.366) 


(0.821) 


Panel B: Comparison Group 




Minutes in the five dimensions 


0.0124* 

(0.009) 


- 


- 


0.136* 

(0.006) 


0.062 

(0.081) 


- 


- 


0.064 

(0.098) 


Percentage of Intervals in the five dimensions 




-0.036 




-0.052 




0.033 




0.034 


with highly explicit instruction 




(0.456) 




(0.296) 




(0.367) 




(0.406) 


Percentage of Intervals in the five dimensions 






0.001 


-0.011 






-0.011 


-0.040 


with high quality student practice 






(0.993) 


(0.984) 






(0.796) 


(0.405) 


Panel C: P-values from t-tests comparing treatment and comparison estimates 




Minutes in the five dimensions 


0.194 


- 


- 


0.121 


0.799 


- 


- 


0.598 


Percentage of Intervals in the five dimensions 
with highly explicit instruction 


- 


0.760 


- 


0.619 


- 


0.480 


- 


0.273 


Percentage of Intervals in the five dimensions 
with high quality student practice 


- 


- 


0.494 


0.365 


- 


- 


0.415 


0.448 



NOTES: 



These analyses use available data from all years (Grade 1 and Grade 2 analysis sample sizes are 2,199 and 2,133 classrooms, respectively). The complete Reading First Impact 
Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 state) located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First 
schools. For grade 2 in 2006, one non-RF school could not be included in the analysis because test score data were not available. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. P-values are in parentheses. 

EXHIBIT READS: For the treatment group in grade 1, the regression coefficient between minutes spent teaching the five dimensions of reading and student 
achievement is .042, which means that a one-minute difference in the amount of time spent teaching the five dimensions of reading per daily reading block is associated 
with a 0.04 point difference in student test scores. This association is not statistically significant (p=.318). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006, and 2007, as well as from state/district education agencies in those sites that already used the SAT 10 for 
their standardized testing (i. e. , FL, KS, MD, OR): RFIS Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 








Exhibit 3.18: Regression Coefficients Between All Predictors and Reading 
Comprehension 


1 II 


Panel A: GRADE 1 


Minutes in the five dimensions 


0.089 

(0.056) 


0.078 

(0.171) 


Percentage of Intervals in the five dimensions with 
highly explicit instruction 


-0.126* 

(0.019) 


-0.136* 

(0.013) 


Percentage of Intervals in the five dimensions with high 
quality student practice 


0.128* 

(0.034) 


0.118 

(0.059) 


Observation length 




-0.022 

(0.645) 


GATS score 




3.702* 

(0.002) 


Percentage of students engaged with print 




0.036 

(0.194) 


Panel B: GRADE 2 


Minutes in the five dimensions 


0.042 

(0.273) 


-0.010 

(0.825) 


Percentage of Intervals in the five dimensions with 
highly explicit instruction 


-0.015 

(0.715) 


-0.039 

(0.376) 


Percentage of Intervals in the five dimensions with high 
quality student practice 


0.006 

(0.909) 


-0.016 

(0.761) 


Observation length 




0.018 

(0.638) 


GATS score 




5.407* 
(<0.001 ) 


Percentage of students engaged with print 




0.002 

(0.939) 



NOTES: 



These analyses use the sample of classrooms for which all predictors are available (Grade 1 and Grade 2 analysis 
sample sizes are 1,399 and 1,363 classrooms, respectively). The complete Reading First Impact Study sample 
includes 248 schools from 18 sites (17 school districts and 1 state) located in 13 states. 125 schools are Reading First 
schools and 123 are non-Reading First schools. For grade 2 in 2006, one non-RF school could not be included in the 
analysis because test score data were not available. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. 
P-values are in parentheses. 

EXHIBIT READS: Controlling for the other variables in the model, the regression coefficient between minutes 
spent teaching the five dimensions of reading and student achievement is .089, which means that a one-minute 
difference in the time spent teaching the five dimensions per daily reading block of reading is associated with a 
0.09 difference in student test scores. This association is not statistically significant (p=.056). 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006 and 2007, as well as from state/district education 
agencies in those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR): RFIS 
Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007 



Final Report: Exploratory Analyses 



55 






• Among the secondary predictors, the relationship between instructional organization and 
order was positively and statistically significantly related to student test scores in both first 
and second grade. A one point difference in the measure of instructional organization and 
order (measured on a five point scale) was associated with a 3.7 point difference in test scores 
in first grade and a 5.4 point increase in second grade. 

These analyses were conducted separately by treatment status to determine whether the relationship 
between instruction and comprehension differed between the two groups of schools. Results are shown in 
Exhibit 3.19. Again, there is no pattern of statistically significant differences across the two groups of 
schools. 

Summary 

In sum, the correlational analyses described above indicate a positive association between time spent on 
the five essential components of reading instruction promoted by the program and reading comprehension 
as measured by the SAT 10, but these findings are sensitive to both model specification and the sample 
used to estimate the relationship. In addition, these analyses do not support causal inferences. 



56 



Final Report: Exploratory Analyses 





Exhibit 3.19: Regression Coefficients Between All Predictors and Reading Comprehension 


by Treatment Status 












Grade 1 




Grade 2 




1 


II 


III 


IV 


Panel A: Treatment Group 


Minutes in the five dimensions 


0.081 


0.030 


0.071 


0.018 




(0.330) 


(0.772) 


(0.274) 


(0.822) 


Percentage of Intervals in the five dimensions with 


-0.056 


-0.035 


-0.058 


-0.125 


highly explicit instruction 


(0.546) 


(0.743) 


(0.414) 


(0.147) 


Percentage of Intervals in the five dimensions with 


0.052 


0.023 


0.120 


0.076 


high quality student practice 


(0.595) 


(0.827) 

-0.041 


(0.151) 


(0.395) 

0.014 


Observation length 


- 


- 




(0.690) 




(0.867) 


GATS score 




1.846 




6.854* 




(0.343) 




(<0.001 ) 


Percentage of students engaged with print 


- 


0.078 

(0.100) 


- 


0.005 

(0.903) 


Panel B: Comparison Group 


Minutes in the five dimensions 


0.151* 


0.141 


0.045 


-0.062 




(0.039) 


(0.211) 


(0.404) 


(0.376) 


Percentage of Intervals in the five dimensions with 


-0.203* 


-0.215* 


0.015 


0.013 


highly explicit instruction 


(0.012) 


(0.019) 


(0.781) 


(0.830) 


Percentage of Intervals in the five dimensions with 


0.144 


0.164 


-0.148 


-0.138 


high quality student practice 


(0.135) 


(0.108) 

-0.113 


(0.053) 


(0.406) 

-0.011 


Observation length 


- 


- 




(0.157) 




(0.839) 


GATS score 




6.301* 




4.813* 


" 


(0.006) 


” 


(0.012) 






-0.016 




-0.046 


Percentage of students engaged with print 




(0.745) 




(0.211) 


Panel C: P-values from t-tests comparing treatment and comparison estimates 


Minutes in the five dimensions 


0.523 


0.462 


0.764 


0.450 


Percentage of Intervals in the five dimensions with 
highly explicit instruction 


0.228 


0.203 


0.414 


0.186 


Percentage of Intervals in the five dimensions with 
high quality student practice 


0.503 


0.334 


0.018* 


0.254 


Observation length 


- 


0.573 


- 


0.800 


GATS score 


- 


0.135 


- 


0.417 


Percentage of students engaged with print 


- 


0.165 




0.336 



NOTES: 

These analyses use the sample of classrooms for which all predictors are available (Grade 1 and Grade 2 analysis sample 
sizes are 1,399 and 1,363, respectively). The complete Reading First Impact Study sample includes 248 schools from 18 
sites (17 school districts and 1 state) located in 13 states. 125 schools are Reading First schools and 123 are non-Reading 
First schools. For grade 2 in 2006, one non-RF school could not be included in the analysis because test score data were 
not available. 

A two-tailed test of significance was used; statistically significant findings at the p<.05 level are indicated by *. In panels 
A and B, p-values are in parentheses. 

EXHIBIT READS: Controlling for the other variables in the model, the regression coefficient between minutes 
spent teaching the five dimensions of reading and student achievement is .081, which means that a one-minute 
difference in the time spent teaching the five dimensions of reading per daily reading block is associated with a 
0.08 difference in student test scores. This association is not statistically significant (p=.330). 

SOURCES: RFIS SAT 10 administration in the spring of 2006 and 2007, as well as from state/district education agencies 
in those sites that already used the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR): RFIS Instructional 
Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007; RFIS Global Appraisal 
of Teaching Strategies, fall 2005, spring 2006, fall 2006, and spring 2007 and RFIS Student Time-on-Task and 
Engagement with Print, fall 2005, spring 2006, fall 2006, and spring 2007 



Final Report: Exploratory Analyses 



57 









Summary 



This chapter explored a number of hypotheses to explain the pattern of observed impacts. Analyses that 
explored the association between the length of implementation of Reading First in the study schools and 
reading comprehension scores, as well as between the number of years students had been exposed to 
Reading First instruction and reading comprehension scores were inconclusive. No statistically significant 
variation across sites in the pattern of impacts was found. Correlational analyses indicate a positive 
association between time spent on the five essential components of reading instruction promoted by the 
program and reading comprehension as measured by the SAT 10, but these findings appear to be sensitive 
to model specification and the sample used to estimate the relationship. 

The study finds, on average, that after several years of funding the Reading First program, it has a 
consistent positive effect on reading instruction yet no statistically significant impact on student reading 
comprehension. Findings based on exploratory analyses do not provide consistent or systematic insight 
into the pattern of observed impacts. 



58 



Final Report: Exploratory Analyses 





Appendix A: State and Site Award Data 



Appendix A presents additional information on when Reading First Impact Study sample sites first 
received Reading First awards (Exhibit A. 1). 



Exhibit A.1 : Award Date by Site in Order of Date when Reading First Funds Were First Made 
Available for Implementation 





Date Initial Reading 
First Award Was 
Announced 


Date when Reading 
First Funds Were First 
Made Available for 
Implementation 


Site 9 


03/2003 


04/2003 


Site 12 


04/2003 


05/2003 


Site 2 


06/2003 


06/2003 


Site 6 


05/2003 


06/2003 


Site 5 


02/2003 


07/2003 


Site 4 


05/2003 


07/2003 


Site 18 


06/2003 


08/2003 


Site 10* 


10/2003 


08/2003 


Site 11* 


10/2003 


1 0/2003 


Site 17* 


08/2003 


12/2003 


Site 14 


01/2004 


02/2004 


Site 8 


01/2004 


03/2004 


Site 3 


03/2004 


04/2004 


Site 13 


01/2004 


04/2004 


Site 15 


10/2003 


05/2004 


Site 1 


05/2004 


06/2004 


Site 7 


05/2004 


06/2004 


Site 16 


03/2004 


08/2004 



NOTE: 

Sites 10, 1 1 an d 17 “backdated” the point at which schools could begin spending their grant money. It is not an error that the 
schools appear to have been given their money before their grants were announced. 

SOURCE: Reading First District Coordinators 



Final Report: State and Site Award Data 



A-1 









Appendix B: Methods 



This appendix describes the general regression discontinuity approach used to estimate the impacts of 
Reading First and presents the specific models used to estimate impacts. In addition, it describes how the 
issue of multiple hypothesis testing was addressed and provides information about statistical precision. 

Part 1: Regression Discontinuity Design 

Approach 

The Reading First Impact Study is based on a regression discontinuity design that capitalizes on the 
systematic process used by a number of school districts to allocate their Reading First funds. 1 A 
regression discontinuity design is the strongest quasi-experimental method that exists for estimating 
program impacts. Under certain conditions (which are met by the present study) this method can approach 
the rigor of a randomized experiment. 2 The conditions include: 

1) Schools eligible for Reading First grants were rank-ordered for funding based on a 
quantitative rating, such as an indicator of past student reading performance or poverty. 

2) A cut-point in the rank-ordered priority list separated schools that did or did not receive 
Reading First grants, and this cut-point was set without knowing which schools would then 
receive funding. 

3) Funding decisions were based only on whether a school’s rating was above or below its local 
cut-point; nothing superseded these decisions; and further, 

4) The shape of the relationship between schools’ ratings and outcomes is correctly modeled. 

To see how the method works, consider a hypothetical school district that allocates its $2 million annual 
Reading First grant to 10 schools in equivalent allotments of $200,000, per year, per school. The district 
also has prioritized the schools with the highest rates of poverty, as measured by the percentage of 
students eligible for free or reduced priced meals. The district therefore awards grants first to the school 
with the highest poverty rate, then to the school with the next-highest poverty rate, and so on, until ten 
schools receive grants and all of the Reading First funding has been allocated. 

Exhibit B.l illustrates how the dividing line, or “cut-point,” between the last funded school and the first 
school not funded on the district’s priority list (or between the 10th and 1 1th schools on this hypothetical 
district’s list) creates a “discontinuity” that makes it possible to estimate program impacts on future 
outcomes. The vertical axis of the exhibit represents a future outcome measure for each school, such as its 



1 The Reading First Impact Study was originally planned as a randomized control study, in which eligible schools from a sample 
of districts were to receive Reading First funds or become members of a non-Reading First control group. The approach was 
not feasible, however, in the 38 states that had already begun to allocate their Reading First grants before the study began. 
Furthermore, in the remaining states, randomization was counter to the spirit of the Reading First Program, which strongly 
emphasizes serving the schools most in need. It was possible, however, to randomize schools in one site. 

2 Regression discontinuity analysis was introduced by Thistlethwaite and Campbell (1960) and has more recently experienced a 
resurgence of interest (e.g., Cappelleri et al., 1991; Cook, 2008; Goldberger, 1972; Hahn, Todd and Van Dcr Klaauw, 2001; 
Mohr, 1995; Rcichardt, Trochim, and Cappelleri, 1995; andTrochim, 1990). 



Final Report: Methods 



B-1 





average student reading score in a subsequent year. The horizontal axis represents the rating used to 
determine each school’s priority for Reading First (in this example, the percentage of past students 
eligible for free or reduced price meals). Schools to the left of the cut-point do not receive Reading First 
funding and serve as a “comparison group” for the impact analysis; these schools are referred to as non- 
Reading First schools. Schools to the right of the cut-point receive Reading First funding; these schools 
represent the “treatment group” for the impact analysis, and are referred to as Reading First schools. 



Exhibit B.1: Regression Discontinuity Analysis for a Hypothetical School District 




The exhibit illustrates a downward-sloping relationship between schools’ ratings and their future 
outcomes. This implies that schools with a higher proportion of past (and thus future) students who live in 
poverty will tend to have lower levels of future student achievement. In the absence of Reading First, 
average student achievement at non-Reading First schools would therefore tend to be higher than at 
Reading First schools. Consequently, the average outcome for non-Reading First schools most likely 
over-states what this average would have been for Reading First schools without the program (their 
“counterf actual”). Because of this, a simple comparison of average outcomes for Reading First schools 
and non-Reading First schools would understate the impact of Reading First. 

Given the way that schools were selected for Reading First, however, it is possible to obtain unbiased 
estimates of the program’s impacts on future outcomes by controlling statistically for the relationships 
that exist between school outcomes and ratings. (These relationships comprise the “regression” part of 
regression discontinuity analysis.) Intuitively, this analysis would proceed as follows. The first step is to 
fit a regression line through the data points for non-Reading First schools, as indicated by the solid line to 
the left of the cut-point in Exhibit B . 1 . The second step is to extrapolate the fitted line across the cut-point 
to predict what student achievement would have been for Reading First schools — in the absence of the 
program. This is indicated by the dashed line in the exhibit. The third step is to fit a regression line 
through the data points for Reading First schools, as indicated by the solid line to the right of the cut- 
point. (For the purpose of this hypothetical example, the two fitted lines are assumed to have the same 
slope and arc thus parallel, which simplifies the analysis but is not necessary. ) The impact of Reading 
First thus can be measured by the vertical distance between the solid fitted line for Reading First schools 



B-2 



Final Report: Methods 







(what actually happened in Reading First schools after the program was launched) and the dashed 
extrapolated line for Reading First schools (the counterfactual prediction of what would have happened in 
Reading First schools without the program). This distance is indicated by a two-sided arrow. 

In short, the analysis uses the observable discontinuity in the regression relationship to identify the impact 
of Reading First. The magnitude of the discontinuity indicates the magnitude of the impact. If the 
regression model has the correct shape for the data being modeled (for example, two parallel straight lines 
for Reading First and non-Reading First schools), the discontinuity provides an unbiased impact estimate. 

The approach works properly, if schools’ ratings are the only thing that determines their selection for 
Reading First. Consequently, only background characteristics that are correlated with ratings can be 
correlated with selection for the program. In other words, the only characteristics that can differ 
systematically between Reading First schools and non- Reading First schools are those correlated with 
their ratings. Controlling statistically for the ratings thereby controls for any systematic pre-existing 
differences between the two groups of schools. 3 It is this control that makes unbiased impact estimates 
possible. 

Seventeen of the 18 sites in the Reading First Impact Study (16 school districts and one state program) 
allocated their Reading First grants in ways that meet the requirements of a regression discontinuity 
design. Each site prioritized its eligible schools according to a specified quantitative indicator, in most 
cases, an indicator based on a measure of student poverty, student performance, or both. 4 (See Exhibit B.2 
for the criteria used by each site to rate its schools for Reading First.) Each site then allocated its Reading 
First funds according to the prioritized list, funding the top priority school first, the second priority school 
next, and so on through the list, until all available resources were allocated. In the context of this study, 
these sites are referred to as regression discontinuity design (RDD) sites. 

The study sample was drawn from Reading First schools and non-Reading First schools whose ratings 
were as close as possible to their sites’ local cut-point. 5 Half of the schools in the study sample are 
Reading First schools and half are non-Reading First schools. 6 Only 9 of the 248 sample schools from 
study sites had their rating-based Reading First funding status changed. Consequently, the study’s sites 
support what is called a “sharp” regression discontinuity analysis, which is the strongest form of the 
design. 7 



3 It is because regression discontinuity analysis utilizes “selection on observables” (i.e., values of the rating) that it can produce 
unbiased impact estimates (Cain, 1975). This feature is what distinguishes the approach from other quasi-experimental designs. 

4 A separate rating coefficient (in the impact estimation model) was specified for each site to account for differences in rating 
variables and cut-points. These differences enhance the generalizability of the present study because it comprises 17 regression 
discontinuity analyses from different parts of the United States. 

5 Note that the RDD can be compromised if there is little or no variation on the rating variable within treatment and comparison 
groups in a given site. As illustrated in Exhibit B.2, however, the schools selected for the study sample were both close to their 
local cutpoints and varied with respect to the rating variable. Therefore, this potential problem was not present in the study 
sample. 

6 These proportions were exact for the original study sample of 258 schools. With the subsequent loss of 10 schools, they remain 
almost exact. 

7 A sharp regression discontinuity analysis has very few cases where assignment to treatment or comparison status based on 
ratings is changed due to other considerations. A “fuzzy” regression discontinuity design has more such aberrant cases. A 
fuzzy regression discontinuity analysis is more complex and requires further assumptions (Shadish, Cook and Campbell, 

2002 ). 



Final Report: Methods 



B-3 





Exhibit B.2: Numbers, Ratings, and Cut-points for Selection of Reading First and Reading First 
Impact Study Schools, by Site (Initial Sample for 17 Sites, Excluding Random Assignment 
Site) 



No. of Schools 

Site Rated (Funded) Number of Sample Schools Not Funded Cut-point Number of Sample Schools Funded 



Site 8 1 


199 


(74) 


33.0 ...136.7 1 16 




144.9 




16 1 148.3 ... 184.3 


Site 3 2 


31 


(16) 


25.3 ...25.3 12 


_ 


30.5 


12 


| 37.3... 48.1 


Site 7 2 


44 


(15) 


36.4... 57.9 1 11 




70.2 


11 


| 79.7... 97.1 


Site 14 I- 2 


43 


(23) 


51.0 ... 88.0 1 11 




88 


11 


1 1 36 ... 174.0 


Site 5 2 ’ 4 


58 


(23) 


1.0... 14.0 | 10 




18 


10 


| 22.0 ... 29.0 


Site 2 2 


56 


(11) 


90.0 ... 58.0 | 8 




52.5 


8 


32.0 ...23.0 


Site 10 2 


34 


(16) 


100.0 ... 95.0 | 8 




86 


8 


78.0 ... 64.0 


Site 9 2 


30 


(12) 


46.0... 92.0 1 7 




136.5 


7 153.0 ... 177.0 


Site 13 2 


24 


(7) 


85.7... 93.5 | 7 




96.9 


7 99.7... 99.7 


Site 11 2 


19 


(12) 


100.0 ... 92.0 | 6 




86 


6 1 79.0 . 


..69.0 


Site 16 2 


40 


(24) 


38.5... 62.2 | 6 




67.1 


6 | 75.2 . 


..95.4 


Site 4 2 


11 


(6) 


40.5... 40.5 5 




50 


5 | 59.5... 67.4 


Site 6 3 


8 


(4) 


8.0 ... 8.0 


4 


4.5 


4 1 1.0 ...1.0 




Site 15 3 


8 


(4) 


8.0 ... 8.0 

IF 


4 


4.5 


4 1 1.0 ...1.0 


1 Ratings based upon proposals 

2 Ratings based on student achievement 
and/or poverty 


Site 12 2 


7 


(4) 


14.3 ... 14.3 |_ 


3 


20.9 


j_J 28.0 ... 35.6 


3 Rankings based on student achievement 
and/or poverty 


Site 17 2 


23 


(14) 


100.0 ... 90.0 [ 


3 


85.5 


3] 84 ... 67.0 


4 Other 


Site 18 24 


21 


(6) 


215.0 ... 151. o[ 


3 


144.5 


3 | 125.0 ... 101.0 



16 12 8 4 0 4 8 12 16 

Number of Schools 



NOTES: 

Ratings varied in directionality and metrics; in some sites, higher scores indicated greater needs; in other sites, lower scores 
indicated greater needs. 

EXHIBIT READS: Site 8 rated 199 schools, and funded 74 schools. The RFIS sample in Site 8 included 32 schools — 
16 non-Reading First schools and 16 Reading First schools — that were rated from 136.7 to 148.3, shown at the left 
and right sides of the shaded bar, respectively. The cut-point was at 144.9. The lowest school rating was 33, and the 
highest school rating was 184.3. 

SOURCES: Interviews with sites ’ Reading First coordinators in 2004. 



In the 18th study site (a school district), it was possible to randomly assign a subset of its Reading First- 
eligible schools to receive or not receive Reading First funds. In this site, five candidate schools were 
assigned to Reading First and five were assigned to a control group. Hence, this site provides a group- 
randomized experiment. This site is referred to as the experimental site. 

Sample Size 

Although regression discontinuity analysis can provide unbiased impact estimates under the conditions 
met by this study — and thus is comparable to a true experiment in this regard — the quasi-experimental 
approach requires a much larger sample of schools to provide the same precision as an experiment 
because one must include the rating variable in any models to account for the design effect (Bloom, 
Kemple and Gamse, 2004). The study team conducted analyses of the effect of including the rating of 
schools as a covariate for a regression discontinuity analysis of program impacts. The team determined 
that if ratings are used as a covariate, the variance of the impact estimator for a regression discontinuity 



B-4 



Final Report: Methods 









analysis will be four times that for a corresponding experiment. 8 Hence, to achieve the same minimum 
detectable effect the regression discontinuity analysis would need four times as many schools as the 
experiment. 

Based on these analyses and extensive discussions among members of the research team, IES staff, and 
the project’s technical work group , it was decided that a sample of roughly 240 schools was needed, 
which is four times the sample size planned for the original experimental design. This larger sample size 
was necessary for the study to achieve a minimum detectable effect size of 0.20 standard deviations. As 
noted above, initial recruitment efforts produced a sample of 258 schools from one state site and 17 
district sites. These 18 sites represent a total of 13 states. Due to refusals, school closings, reconfiguring, 
or redistricting, 1 0 schools (4 RF schools and 6 non-RF schools) subsequently dropped out of the study. 
For results presented in this report, a final analytic sample of 248 schools was used. (See Exhibit B.3 for a 
flowchart of sample selection from regression discontinuity design target sample to the final analytic 
sample.) 

Specification Tests 

In developing the study sample, Reading First schools and non-Reading First schools were selected to be 
as close as possible to their local cut-points for receipt of Reading First funding. This was done to yield 
two groups of schools that were as similar as possible. (See Exhibit B.4 for unadjusted baseline 
characteristics of schools in the study sample.) In addition, program impacts were estimated using a linear 
regression discontinuity model that controls for values of the ratings used to choose schools for program 
funding. Furthermore, estimates of impacts on measures of student reading achievement control explicitly 
for school-level baseline measures of reading achievement. This combination of sample design and 
statistical analysis was expected to provide internally valid estimates of program impacts. 

Three sets of specification tests were conducted to assess whether this expectation was met. 9 Although 
none of these tests by itself can prove that internal validity was achieved, in combination they provide 
evidence that this is most likely the case. The most important such test used a linear regression 
discontinuity model to compare baseline characteristics of Reading First schools and non- Reading First 
schools. If a linear regression discontinuity model is an appropriate way to control for all pre-existing 
differences between the two groups, observable or not, then it should eliminate their observed baseline 
differences. 

Baseline specification tests were conducted using aggregate school-level baseline characteristics. 10 The 
results of these tests in Exhibit B.5 show that none of the adjusted residual differences between Reading 
First schools and non- Reading First schools for the selected baseline characteristics were statistically 
significant. Hence, there is little evidence of residual differences in these school-level baseline 
characteristics. Results of these tests do not provide statistical evidence of substantial bias in impact 
estimates for the present report. Also, because impact estimates for student reading comprehension 
control explicitly for observed differences in school-level mean baseline test scores (typically the 
strongest predictor of future test scores), they provide further protection against bias. 



See the study’s Interim Report Appendix B, Part 5 (Gamse, Bloom, Kemple & Jacob, 2008) for details of these analyses. 

9 See the study’s Interim Report Appendix B (Gamse, Bloom, Kemple & Jacob, 2008) for a detailed presentation of the 
specification tests conducted to assess the study’s internal validity. 

111 Baseline data were available at the school level only. 



Final Report: Methods 



B-5 





Exhibit B.3: RFIS Sample Selection: From Regression Discontinuity Design Target Sample 
to Analytic Sample 




*The final analytic sample includes 146 schools from 7 sites that have 8 or more RF schools (74 RF, 72 non-RF schools) 
and 102 schools from 6 sites that have between 3 and 7 RF schools (51 RF, 51 non-RF schools). 



B-6 



Final Report: Methods 





Exhibit B.4: Observed Differences in Baseline Characteristics of Schools in the Study Sample: 
2002-2003 



Characteristic 


Actual 

Mean 

for 

Reading 
First Schools 


Actual 

Mean 

for 

Non-Reading 
First Schools 


Difference 


Statistical 
Significance of 
Difference 
(p-value) 


Students 


Male (%) 


52.3 


51.6 


0.7* 


(0.049) 


Race (%) 


Asian 


3.1 


3.3 


-0.2 


(0.670) 


Black 


35.6 


33.9 


1.7 


(0.532) 


Hispanic 


26.7 


22.5 


4.1* 


(0.021) 


White 


34.2 


39.8 


-5.6* 


(0.006) 


American Indian/Alaskan 


0.5 


0.5 


0.0 


(0.847) 


Free Lunch and Reduced Lunch (%) 


74.4 


68.9 


5.5* 


(0.002) 


Schools 


Eligible for Title 1 (%) 


97.6 


90.7 


6.9* 


(0.013) 


Locale (%) 


Large City 


39.2 


37.4 


1.8 


(0.476) 


Mid-size City 


36.8 


34.6 


2.2 


(0.434) 


Other 3 


24.0 


28.0 


-4.0 


(0.286) 


Size 


Total Number of Students 


474.8 


488.7 


-13.9 


(0.462) 


Number of Students in Grade 3 


71.6 


76.0 


-4.4 


(0.162) 


Student/Teacher Ratio 


15.1 


15.2 


-0.1 


(0.613) 


Third Grade Reading Performance 

Deviation from State RF Mean 


Proficiency Rate (%) b 


-1.3 


1.8 


-3.0* 


(0.019) 



NOTES: 

The complete RF study sample includes 248 schools from 18 sites (17 districts and 1 state) located in 13 states. 125 
schools are Reading First schools and 123 are non-Reading First schools. 

A two-tailed test of significance was used, and where applicable, statistically significant findings at the p < .05 level are 
indicated by * 

a Other Locale includes urban fringe of a large city, urban fringe of a mid-sized city, large town, small town, and rural. 
b A school-’s proficiency score is defined as the percentage of third grade students (or fourth or fifth grade when third grade is 
unavailable) in the school that score at or above the state-defined proficiency threshold on the state’s reading assessment. 
The values in this row represent the average percentage point deviation from the mean proficiency score for the Reading 
First schools in the state. 

EXHIBIT READS: On average, , 52.3 percent of students in Reading First schools and 51.6 percent of students in non- 
Reading First schools were male. The difference on the percent of male students between Reading First and non-Reading First 
schools was 0.7 percentage points. The difference was statistically significant at the p<.05 level (p=.049). 

SOURCES: Data on baseline characteristics are from the Common Core of Data. 



Final Report: Methods 



B-7 








Exhibit B.5: Estimated Residual Differences in Baseline Characteristics of Schools in the Study 
Sample: 2002-2003 



Statistical 
Significance of 

Estimated Residual Difference 



Characteristic 


Difference 


(p-vaiue) 


Students 


Male (%) 


0.9 


(0.246) 


Race (%) 


Asian 


0.9 


(0.363) 


Black 


-7.2 


(0.199) 


Hispanic 


3.3 


(0.345) 


White 


2.8 


(0.503) 


American Indian/Alaskan 


0.2 


(0.182) 


Free Lunch and Reduced Lunch (%) 


-6.0 


(0.073) 


Schools 


Eligible for Title 1 (%) 


-1.4 


(0.802) 


Locale (%) 


Large City 


4.3 


(0.419) 


Mid-size City 


9.1 


(0.108) 


Other 3 


-13.4 


(0.083) 


Size 


Total Number of Students 


-0.9 


(0.982) 


Number of Students in Grade 3 


-3.8 


(0.558) 


Student/Teacher Ratio 


0.1 


(0.861) 


Third Grade Reading Performance 

Deviation from State RF Mean 


Proficiency Rate (%) b 


4.3 


(0.085) 



NOTES: 

The complete RF study sample includes 248 schools from 18 sites (17 districts and 1 state) located in 13 states. 125 
schools are Reading First schools and 123 are non-Reading First schools. 

The “Estimated Residual Difference” is the adjusted residual differences between Reading First schools and non- Reading First 
schools estimated using the regression discontinuity model, which controls for each school’s rating. 

A two-tailed test of significance was used, and where applicable, statistically significant findings at the p<.05 level are 
indicated by *. 

a Other Locale includes urban fringe of a large city, urban fringe of a mid-sized city, large town, small town, and rural. 

b A school’s proficiency score is defined as the percentage of third grade students (or fourth or fifth grade when third grade is 
unavailable) in the school that score at or above the state-defined proficiency threshold on the state’s reading assessment. 
The values in this row represent the average percentage point deviation from the mean proficiency score for the Reading 
First schools in the state. 

EXHIBIT READS: The estimated residual difference on the percent of male students between Reading First and non- 
Reading First schools was 0.9 percentage points. The difference was not statistically significant at the p<.05 level 
(p=.246). 

SOURCES: Data on baseline characteristics are from the Common Core of Data. 



B-8 



Final Report: Methods 








Part 2: Estimation Methods 



The slightly different statistical models used to estimate the impact of Reading First on the three major 
outcome domains (student reading achievement, classroom instruction, and student engagement with 
print), as well as surveys, shared most elements. Flowever, because there were some differences in the 
models for reading achievement, classroom instruction and student engagement with print and surveys, 
the approach for each is described separately below. 

Impact tables throughout the report and appendices contain: 1) the actual, unadjusted mean outcomes for 
Reading First schools in the study sample (“Actual Mean with Reading First”), 2) the best estimate of 
what would have happened in RF schools absent RF funding (“Estimated Mean without Reading First”), 
3) the impact estimate, 11 and 4) the effect size of the impact estimate. 12, 13 

Impact Estimation Method for Reading Achievement 

The statistical model used to estimate RF impacts on student reading comprehension and decoding is 
described by (1) below: 



Yijkm — 'y'ff)mSmkYR l + j3\mSmkTk + y fj ImSmkRk YR ) + ■£/}»&* r^. yr, (i) 

mt m mt mt 

+ yZjkYR t + GnXnijkmYR t “h JUk Ojk + Sijk 

t nt 

where: 

Yijkm = the post-test for student i from classroom j in school k in site m, 

Smk = one if school k is in site m and zero otherwise, m = 1 to 18, 

T k = one if school k is a treatment school and zero otherwise, 

R k = the rating for school k (standardized and centered by site), 

Y_ Um = the mean baseline pretest for school k (standardized and centered by site), 

YR t = indicator for follow-up years, 2005, 2006, or 2007 14 

Z jk = a variable indicating when the post-test was given for classroom j in school k (site- 
centered), 

Xnijkm = demographic characteristic n of student i from classroom j in school k, in site m 
jUk , Ojk and £yk = school-level, classroom- level, and student-level random error terms, 
respectively, assumed to be independently and identically distributed. 



1 1 The estimates of what would have happened in RF schools absent RF funding are calculated by subtracting the impact 
estimates from the RF schools’ actual mean values. 

12 The effect size of the impact is the impact divided by the actual standard deviation of the outcome for the non-Reading First 
Schools pooled across all years for which the outcome was available. 

13 When calculating the effect sizes, standard deviation from the non-Reading First schools were used instead of the pooled 
standard deviation from Reading First and non- Reading First schools because the treatment could have effected the 
distribution — hence the standard deviation — of the outcomes in Reading First schools but not in non-Reading First schools. 
The study team wanted to use a stable standard deviation and non-Reading First schools provided that. It is also important to 
note that the standard deviations for the student outcomes obtained from non-Reading First schools are very close to those 
observed in the national norming sample. 

14 For decoding, this indicator is not used because only one year of data is available for this outcome. 



Final Report: Methods 



B-9 





The average estimated value of j8im (m = 1, 2, . . 18), weighted by the number of RF schools in each 
site, is the program impact for the average RF school in the study sample. 

The student achievement impact model (Equation 1) used to estimate impacts on reading comprehension 
and decoding has the following characteristics: 

• It is a multi-level model that reflects the nested structure of the data by accounting for three 
levels of clustering in the estimation of standard errors: clustering of students within 
classrooms, classrooms within schools, and schools within study sites. 

• Baseline covariates are added to the model to improve precision. These covariates include 
student gender, student age at start of school year, 15 date of the post-test at the classroom 
level, and a school-level pre-program reading performance measure. 1617 

• The rating variable was not included in the model for the one site that assigned schools to 
Reading First and non-Reading First groups randomly. 

• In estimating reading comprehension pooled impacts for the combined sample from 2005, 
2006 and 2007, the covariates for site, rating, pretest, test date, and demographic 
characteristics were interacted with an indicator for follow-up year (2005, 2006 or 2007). 

• In estimating decoding impacts, the covariates for site, rating, pretest, test date, and 
demographic characteristics were not interacted with an indicator for follow-up year because 
there is only one year of data for this outcome. 

Impacts on Classroom Instruction and Student Engagement with Print 

The impacts of Reading First on classroom instruction and student engagement with print were estimated 
using the following three-level model (with observations at level one, classrooms at level two, and 
schools at level three): 



Yijkm — I j3 0 mS mk YR t I J3 ImSmkTk + I ft 2mSmkRkYR t + JLl k + V j /( + S ^ (2) 

mt m mt 

Where: 

Yjjkm = the outcome measure for observation i from classroom j in school k in site m, 

Smk = one if school k is in site m and zero otherwise, (m= 1,2, ..., 18), 

T k = one if school k is a treatment school and zero otherwise, 

R k = the rating for school k (standardized and centered by site), 

YR t = indicator for follow-up years, 2005, 2006, or 2007, 18 



15 Age at start of the school year is each student’s age as of September 1 of the given year. For example, age as of September 1, 
2005 for the 2005-2006 school year. 

16 Different pre-program perfonnance measures were constructed for early and late award sites. For the 10 early award sites and 
one late award site (which had no fall 2004 test data due to a hurricane), perfonnance on a state reading test (when available, 
we used an average of test scores from up to three prc-RF years) was used as a school level pretest measure. For late award 
sites except for the one without available fall 2004 data, the mean fall 2004 SAT 10 test scores for each school/grade were 
used as the pretest measure. 

17 As a robustness test, the analysis was conducted without some or all of these additional covariates and the impact estimates 
stayed virtually unchanged. Results for these additional tests are available upon request. 

18 For the STEP, only two year indicators are included in the model, since STEP data was not collected in the first year. 



B-10 



Final Report: Methods 





p k , Vj k and 8jj k = school- level, classroom-level, and observation-level random error terms, 
respectively, assumed to be independently and identically distributed. 

The impact estimate is the average estimated value of /3\m (m = 1,2, ..., 18) weighted by number of 
treatment schools in each site. 



The impact estimation model for classroom instruction and student engagement with print described by 
(Equation 2) has the following characteristics: 



• It is a multi-level model that reflects the nested structure of the data by accounting for three 
levels of clustering in the estimation of standard errors: clustering of observation days within 
classrooms, classrooms within schools, and schools within sites. 

• A rating variable was not included in the model for the one site that assigned schools to 
Reading First and non-Reading First groups randomly. 

• In estimating pooled impacts for the combined sample from 2005, 2006 and 2007, the 
covariates for site and rating were interacted with an indicator for follow-up year (2005, 
2006, or 2007). 

Estimation Method for Surveys 

Data from self-report surveys of teachers, reading coaches, and principals were used to estimate the 
impact of Reading First on the key components of scientifically based reading instruction. 19 Two models 
were needed to estimate differences for survey data — one for classroom level data (i.e., from teacher 
survey) and a second for school level data (i.e., from reading coach survey or principal survey). 
Differences for classroom level survey data were estimated using the following two-level model (with 
classrooms at level one and schools at level two): 



Y,k, 



jkm 



(3 OmSmk i (3 ImSmkTk + /3 ImSmkRk i l ' i o'. 

mt m mt 



(3) 



Where: 

Yjj km = the outcome measure for classroom j in school k in site m, 20 
Smk = one if school k is in site m and zero otherwise, (m= 1,2, ..., 18), 

T k = one if school k is a treatment school and zero otherwise, 

R k = the rating for school k (standardized and centered by site), 

v k and Sjk= school-level and classroom-level random error terms, respectively, assumed to be 
independently and identically distributed. 



The difference estimate is the average estimated value of (3 i m (m = 1,2, . . ., 18) weighted by number of 
treatment schools in each site. 



The impact estimation model for classroom level survey data described by (Equation 3) has the following 
characteristics: 



19 Only 2007 survey data is included in these analyses due to low survey response rates in 2005. 

20 To maintain parallel structure with other estimation models presented in this appendix, the nomenclature for classroom (j), 
school (k), and site (m) remains the same even in the absence of student or observation level survey data (i). 



Final Report: Methods 



B-11 





• It is a multi-level model that reflects the nested structure of the data by accounting for two 
levels of clustering in the estimation of standard errors: clustering of classrooms within 
schools and schools within sites. 

• A rating variable was not included in the model for the one site that assigned schools to 
Reading First and non-Reading First groups randomly. 

• Only one year of data were used for survey data, so no interactions with the follow-up year 
were included in the estimation model. 

Differences for school level survey data were estimated using the following ordinary least squares 
regression model: 



Y km — I j3 0 mSmk + 1 mSmkTk I plmSmkRk £ £ 

mt m mt 

Where: 

Yykm = the outcome measure for school k in site m, 21 

S mk = one if school k is in site m and zero otherwise, (m= 1,2, ..., 18), 

T k = one if school k is a treatment school and zero otherwise, 

R k = the rating for school k (standardized and centered by site), 

s k = school-level random error term assumed to be independently and identically 

distributed. 



(4) 



The difference estimate is the average estimated value of ft i m (m = 1,2, . . ., 18) weighted by number of 
treatment schools in each site. 



The impact estimation model for school level survey data described by (Equation 4) has the following 
characteristics: 



• It is a single-level ordinary least squares regression model that accounts for one level of 
clustering in the estimation of standard errors: clustering of schools within sites. 

• A rating variable was not included in the model for the one site that assigned schools to 
Reading First and non-Reading First groups randomly. 

• Only one year of data were used for survey data, so no interactions with the follow-up year 
were included in the estimation model. 

Part 3: Approach to Multiple Hypothesis Testing 



This section addresses the issue of multiple hypothesis testing. It first summarizes the five core principles 
that were used as a guide for addressing the issue in the current study, and then describes a two-stage 
approach for operationalizing these principles. 



21 To maintain parallel structure with other estimation models presented in this appendix, the nomenclature for school (k) and site 
(m) remains the same even in the absence of student or observation level survey data (i) or classroom level survey data (j). 



B-12 



Final Report: Methods 





Principle # 1 : Qualify tests instead of adjusting them: The present analysis qualifies specific hypothesis 
tests using composite tests of pooled hypotheses rather than (1) adjusting significance levels (through 
Bonferroni methods) or (2) adjusting significance thresholds (through Benjamini and Hochberg methods) 
of specific tests. 

Principle #2: Address multiple testing differently for the central research questions of the study and for 
supplemental analyses. The analysis specifies two tiers of hypotheses: Tier I comprises a small number 
of hypotheses about the central research questions of the study, and Tier 2 represents supplemental 
research questions. Multiple testing is treated separately and differently within the two tiers. Statistical 
tests of Tier I hypotheses are considered confirmatory. To address the issue of multiplicity within Tier I, 
the present study tested a reduced set of outcomes by conducting pooled tests of composite hypothesis 
that represent a set of hypotheses that have been tested separately. The Tier 2 hypothesis tests are allowed 
to be much larger and less confirmatory. It may or may not be necessary to qualify these findings for 
multiple testing since they are not confirmatory. 

Principle #2: Delineate separate domains that reflect key clusters of constructs represented by the 
central research questions of a study. Domains comprise broad clusters of outcome constructs that can 
contain multiple measures, subgroups, or follow-up observations. Domains are defined conceptually, and 
do not provide narrow “silos” for collecting findings. The central domains for the present study are 
student reading achievement, classroom reading instruction, and student engagement with print. In 
addition, survey data is a domain for exploratory analyses of support for scientifically based reading 
instruction across study schools. 

Principle # 4 : Report analyses to address multiple comparisons in the background of research reports, 
not in the foreground. For the present study references to the qualifying tests occur in the main text but 
not in tables. 

Principle #5: Use tests for interactions as a composite test (and thus a guide) for focusing on subgroup 
findings. 22 

Based on the above five principles, the present study uses the following two-stage approach to address 
multiple hypothesis testing. The first stage involves prioritizing outcomes and subgroups for the impact 
analysis. The second stage encompasses strategies for conducting composite tests on pooled key 
outcomes. The core features of each stage are described below. 

Stage 1: Creating a Parsimonious List of Outcomes and Subgroups and 
Prioritizing Key Outcomes 

The first stage of the framework involves a process of carefully categorizing and prioritizing the 
outcomes and subgroups for the impact analysis. The goal of this exercise is to create the shortest possible 
list of outcomes and subgroups that reflect the most proximal and policy relevant indicators of Reading 
First’s effectiveness. Analytically, the shorter the list, the less likely it is that one would attribute 



22 If differences between impacts for subgroups are not statistically significant, then individual subgroup results should be 
interpreted with caution. 



Final Report: Methods 



B-13 





statistical significance to an impact that did not truly occur. These outcomes and subgroups were selected 
within distinct measurement domains to correspond to key components of the program’s theory of action 
and the key research questions posed by the program’s evaluation. 

The impact analysis focuses on two components of the Reading First theory of action: 1) aligning 
teachers’ instructional practices and behaviors with the five dimensions of reading instruction, 23 and 2) 
improving students’ reading achievement. 24 The highest priority outcomes within each of these 
measurement domains would constitute “Tier 1” outcomes for the impact analysis. 

Recognizing that a short list of outcomes will almost certainly exclude important policy-relevant 
indicators of Reading First’s effectiveness (a form of Type II error), this first stage of the framework also 
includes the development of a secondary, or “Tier 2,” list of outcomes and subgroups. As discussed 
below, the present study treats Tier 1 and Tier 2 outcomes and their accompanying subgroups separately, 
and potentially differently, if or when making adjustments to the standards used forjudging statistical 
significance. 

Exhibit B.6 provides a list of the Tier 1 and Tier 2 outcomes defined for each measurement domain for 
this report. Also displayed are the grade levels and follow-up periods on which the impact analyses focus. 

Stage 2: Conducting Composite Tests to Qualify Specific Hypothesis Tests 

One approach to qualifying multiple hypothesis tests is to test whether the overall effect of treatment on a 
family of outcomes is significantly different from zero. For example, a policy maker may be interested in 
the effect of an intervention on test scores in general, rather than on each subject separately. Measurement 
of such overall effects has its roots in the literature on clinical trials and on meta-analysis (O’Brien, 1984; 
Logan and Tamhane, 2002; and Hedges and Olkin, 1985). The present analysis constructs summary 
indices that aggregate information over multiple treatment effect estimates within each domain for Tier 1 
outcomes, as well as for survey constructs in Tier 2. See Exhibit B.7. 

Reading Comprehension 

To qualify the impact estimates for each outcome measure for each grade in the reading comprehension 
domain, the present analysis ran a composite regression that pooled the sample across grades 1, 2, and 3 
and two measures: scaled scores and an indicator of whether or not a student scored at or above grade 



23 The RFIS observational instalment, the IPRI, focused primarily on teacher behaviors. In order to ensure that the study also 
collected some data on student behaviors during observed reading instruction, the RFIS team developed the Student Time-on- 
Task Engagement with Print (STEP) instalment. Because student engagement with print is an outcome that is distinct from the 
student reading comprehension or classroom reading instruction domains, it is treated separately. 

24 The Reading First theory of action also includes allocating additional resources for districts and schools to purchase reading 
curricula, materials, and assessments; exposing teachers to professional development and coaching focused on the five 
dimensions of effective reading programs; and holding districts and schools accountable for improved reading achievement. 
The present study was able to measure the impact of Reading First on some of these other elements using survey data. 



B-14 



Final Report: Methods 





Final Report: Methods B-15 



Exhibit B.6: Outcome Tiers for the Reading First Impact Analysis 



Tier 


Domain 




Impacts Estimate 


Variation 


Outcome 


Year 


Grade 


Sample 


Tier 1 


Student Achievement 


Reading Comprehension 
Scaled Score (SAT 10) 


2005, 2006, 2007 Pooled 


Separate for Grade 1,2,3 


Full 


N/A 


% At or Above Grade Level 
(SAT 10) 


2005, 2006, 2007 Pooled 


Separate for Grade 1,2,3 


Full 


TOSWRF Standard Score 


2007 


Separate for Grade 1 


Full 














Instruction 


Time on Five Dimensions 


2005, 2006, 2007 Pooled 


Separate for Grade 1 , 2 


Full 


N/A 


Highly Explicit Instruction 


2005, 2006, 2007 Pooled 


Separate for Grade 1 , 2 


Full 


High Quality Practice 


2005, 2006, 2007 Pooled 


Separate for Grade 1 , 2 


Full 














Student Engagement 
with Print 


% Students Engaged with Print 


2006, 2007 Pooled 


Separate for Grade 1 , 2 


Full 


N/A 
















Tier 2 


Student Achievement 


Reading Comprehension 
Scaled Score (SAT 10) 


2005 


Separate for Grade 1,2,3 


Full, Award Subgroup 


• Variation over time 

• Variation across 
sites 

• Variation by Award 
Subgroup 


2006 


Separate for Grade 1,2,3 


Full, Award Subgroup 


2007 


Separate for Grade 1,2,3 


Full, Award Subgroup 


2005, 2006, 2007 Pooled 


Separate for Grade 1,2,3 


Full, Award Subgroup 




2007 


Separate for Grade 3 


2005/2007 Stayer 
Subgroup 


• Variation by Student 
Exposure 












% At or Above Grade Level 
(SAT 10) 


2005 


Separate for Grade 1,2,3 


Full 


N/A 


2006 


Separate for Grade 1,2,3 


Full 


2007 


Separate for Grade 1,2,3 


Full 


2005, 2006, 2007 Pooled 


Separate for Grade 1,2,3 


Full, Award Subgroup 














Instruction 


Time on Five Dimensions 
(Combined and for Five 
Dimensions separately) 


2005 


Separate for Grade 1 , 2 


Full, Award Subgroup 


• Variation over time 

• Variation across 
sites 

• Variation by Award 
Subgroup 


2006 


Separate for Grade 1 , 2 


Full, Award Subgroup 


2007 


Separate for Grade 1 , 2 


Full, Award Subgroup 


2005, 2006, 2007 Pooled 


Separate for Grade 1 , 2 


Full, Award Subgroup 








B-16 Final Report: Methods 



Exhibit B.6: Outcome Tiers for the Reading First Impact Analysis (continued) 



Tier 



Tier 2 



Domain 



Instruction 



Student Engagement 
with Print 



Professional 
Development in SBRI 



Outcome I 



2005 

Highly Explicit Instruction ^0® 

2007 

2005 

2005 

High Quality Practice ^006 

a y 2007 

2005 

% Students Engaged with Print 2006 



Amount of PD in reading 
received by teachers 

Teacher receipt of PD in the 
five essential components of 
reading instruction 



Teacher receipt of coaching 

Amount of time dedicated to 
serving as K-3 reading coach 



Impacts Estimate 
Grade 



Grades 1, 2, 3 Pooled 
Grades 1, 2, 3 Pooled 



Sample 



Variation 



2005 


Separate for Grade 1 , 2 


Full 


2006 


Separate for Grade 1 , 2 


Full 


2007 


Separate for Grade 1 , 2 


Full 


2005, 2006, 2007 Pooled 


Separate for Grade 1 , 2 


Full, Award Subgroup 








2005 


Separate for Grade 1 , 2 


Full 


2006 


Separate for Grade 1 , 2 


Full 


2007 


Separate for Grade 1 , 2 


Full 


2005, 2006, 2007 Pooled 


Separate for Grade 1 , 2 


Full, Award Subgroup 








2006 


Separate for Grade 1 , 2 


Full 


2007 


Separate for Grade 1 , 2 


Full 








2007 


Grades 1, 2, 3 Pooled 


Full 








2007 


Grades 1, 2, 3 Pooled 


Full 







Final Report: Methods B-17 



Exhibit B.6: Outcome Tiers for the Reading First Impact Analysis (continued) 



Tier 


Domain 


Outcome 


Impacts Estimate 


Variation 


Year 


Grade 


Sample 


Tier 2 


Amount of Reading 
Instruction 


Minutes spent on reading 
instruction per day 


2007 


Grades 1, 2, 3 Pooled 


Full 


N/A 














Supports for 
Struggling Readers 


Availability of differentiated 
instructional materials for 
struggling readers 


2007 


Grades 1, 2, 3 Pooled 


Full 


N/A 










Provision of extra classroom 
practice for struggling readers 


2007 


Grades 1, 2, 3 Pooled 


Full 














Use of Assessments 


Use of assessments to inform 
classroom practice 


2007 


Grades 1, 2, 3 Pooled 


Full 


N/A 







Exhibit B.7: Summary of Impacts and Results of Composite Tests 


Outcome Measure 


Impact 

(p-value) 


Result of 
Composite Test 


Grade 1 Grade 2 Grade 3 


Reading Comprehension 


• Standard scaled score 


4.74 

(p=0.083) 


1.69 

(p=0.462) 


0.30 

(p=0.887) 


p=0.957 for 
composite test 
across 3 grades 
and 2 outcomes 


• Percent reading at or above grade level 


4.22 

(p=0. 1 04) 


1.60 

(p=0.504) 


-0.08 

(p=0.973) 


Instruction 


• Minutes of instruction in 5 reading 
dimensions 


6.92* 

(p=0.005) 


9.79* 

(p<0.001) 


— 


p<0.001 for 
composite test 
across 2 grades 
and 3 outcomes 


• Highly explicit instruction 


3.29* 

(p=0.018) 


3.00* 

(p=0.040) 


— 


• High quality student practice 


0.82 

(p=0.513) 


2.94* 

(p=0.019) 


— 


Student Engagement with Print 


• Percent of students engaged with print 


5.33 

(p=0.070) 


-4.75 

(p=0.104) 




p=0.845 for 
composite test 
across 2 grades 
and 1 outcome 


Key Components of Scientifically Based Reading Instruction at the School-Level 


• Amount of time dedicated to serving as K-3 
reading coach 


33.49* 

(p<0.001) 


p=0.009 for 
composite test 
across 2 
outcomes 


• Availability of differentiated instructional 
materials for struggling readers 


0.01 

(p=0.661 ) 



Key Components of Scientifically Based Reading Instruction at the Classroom-Level (aggregated across 



all three grade levels) 



• Amount of PD in reading received by 
teachers 


12.13* 

(p<0.001) 




• Teacher receipt of PD in the five essential 


0.55* 




components of reading instruction 


(p=0.010) 




• Teacher receipt of coaching 


0.20* 


p<0.001 for 




(p<0.001) 


composite test 


• Minutes of reading instruction per day 


18.47 


across 3 grades 




(p<0.001) 


and 6 outcomes 


• Provision of extra classroom practice for 


0.19 




struggling readers 


(p=0.018) 




• Use of assessments to inform classroom 


0.18 




practice 


(p=0.090) 





NOTES: 



Impact estimates are statistically adjusted (e.g., take each school's rating, site-specific funding cut-point, and other covariates 
into account) to reflect the regression discontinuity design of the study. 

EXHIBIT READS: The result of the composite test for reading comprehension test scores, across three grades and 
two outcomes, is not statistically significant (p=.957). 

SOURCES : RFIS SAT 10 administration in the spring of 2005, 2006, and 2007 as well as from state/district education 
agencies in those sites that already use the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR); RFIS Instructional 
Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, fall 2006, and spring 2007; RFIS Student Time-on-Task 
and Engagement with Print, fall 2005, spring 2006, fall 2006, and spring 2007, and RFIS Survey administration, spring 
2007. 



B-18 



Final Report: Methods 






level. 25 To qualify the six multiple hypotheses tests for these outcomes, the RFIS Team created one 
parsimonious index. The aggregation improves statistical power to detect effects that go in the same 
direction within a domain. The summary index is defined to be the equally weighted average of z-score 
outcome components, with the sign of each measure oriented so that more beneficial outcomes have 
higher scores. 26 

Specifically, the present analysis took the following steps in creating a composite index and conducting 
the analysis: 27 

1. First, z-scores were created for each outcome component in the reading comprehension 
domain by subtracting the unadjusted non-RF mean (pooled across years and grade levels) 
and dividing by its standard deviation (pooled across years and grade levels). Thus, each 
component of the index has a mean of zero and a standard deviation of one for the non-RF 
group. 

2. If an observation unit has a valid response to at least one component measure of the index, 
then any missing values of other component measures are imputed as the random assignment 
group mean. This results in differences between RF and non-RF means of an index being the 
same as the average of those two groups’ means of the components of that index (when the 
components are divided by their comparison group standard deviation and have no missing 
value imputation), so that the index can be interpreted as the average of results for separate 
measures scaled in standard deviation units. 28 

3. The z-scores from each component were averaged to obtain the index and an impact analysis 
was run on this index using a sample that pooled all years and all grade levels together. 

This regression addresses the question whether overall the program “worked” in terms of improving 
student achievement. This result serves as a “qualifier” to the small number of specific hypothesis tests 
shown in impact tables. 29 



25 Although decoding is considered to be in the same domain as comprehension, it was not possible to include the TOSWRF 
scores in the composite because scores are available for only one grade in one year. 

An alternative is to use seemingly unrelated regression effects for specific outcomes to estimate the covariance of the effects 
and then to calculate the mean effect size for groups of estimates in a second step. The average z-score index approach is much 
simpler to work with. The two approaches yield identical treatment effects when there is no item nonresponse and no 
regression adjustment (Kling, Liebman, and Katz, 2007). 

27 The discussion and method presented here draw from Kling, Liebman, and Katz (2007). 

No data imputation was done in constructing the reading achievement composite index. 

29 Though decoding is included in the student achievement domain, it is not possible to include this outcome measure in the 
summary index with the reading comprehension outcomes. As noted earlier, the decoding test was administered in only one 
grade for one year. In addition, 2,158 students who were administered the reading comprehension test in that year and grade 
were not given the decoding test. Therefore, attempts to calculate a common index for the decoding and reading 
comprehension measures would have required collapsing the reading comprehension data to a single year and grade and 
imputing missing decoding scores. This approach was not taken because this process would have resulted in a significant loss 
of statistical power and would have weakened the usefulness of the index as a qualifier for reading comprehension impacts. 



Final Report: Methods 



B-19 





Classroom Instruction 



A similar composite analysis was conducted for the instructional domain. To qualify the impact estimates 
for each outcome measure for each grade in the instructional domain, the analysis ran a composite 
regression which pooled the sample across grades and used an index constructed from z-scores for all 
three instructional outcome measures as the dependent variable. The index of instruction averaged 
together minutes in the five dimensions of reading instruction, percentage of highly explicit instruction, 
and percentage of high quality student practice. 30 

The results from this analysis help to answer the research question whether overall the Reading First 
program has an impact on instructional practice. 

In addition, program impacts for time spent on each of the five dimensions will be reported separately. 
Since the impact on total time spent on the five dimensions will already have been reported, any 
additional qualifying test is not necessary for these analyses. 

Student Engagement with Print 

A similar composite analysis was conducted for the student engagement with print outcome domain. For 
this domain impacts are reported for the full sample in grades 1 and 2 as the percentage of students 
engaged with print. To qualify the two multiple hypotheses tests for these outcomes, the RFIS Team 
reports the result from a composite regression which pools two grades together and represents the 
outcome measure in one parsimonious index, created in the same way that the composite index for 
reading comprehension and instruction was created. 31 This regression addresses the question whether 
overall the program “worked” in terms of having an impact on the percentage of students engaged with 
print. This result serves as a “qualifier” to the small number of specific hypothesis tests shown in impact 
tables. 

Implementation of Key Components of Scientifically Based Reading Instruction 
(Surveys) 

Because survey data was collected at the school level (i.e., reading coach and principal surveys) and at the 
classroom level (i.e., teacher surveys), two composite tests across domains were conducted, one at the 
school level and one at the classroom level. To qualify the impact estimates for each school level survey 
outcome measure, the analysis team ran a composite regression where the dependent variable was a single 
index created from z-scores for each of the two school level outcome variables. The index of key 
components of scientifically based research instruction at the school level averaged together the following 



30 No data imputation was done in constructing the reading instruction composite index. 

31 No data imputation was done in constructing the student engagement with print composite index. 



B-20 



Final Report: Methods 





two survey outcomes: amount of time dedicated to serving as K-3 reading coach and availability of 
differentiated instructional materials for struggling readers. 12 

To qualify the impact estimates for each classroom level survey outcome measure, the analysis team ran a 
composite regression where the dependent variable was a single index created from z-scores for each of 
the six classroom level outcome variables. The index of key components of scientifically based research 
instruction at the classroom level averaged together the following six survey outcomes: amount of PD in 
reading received by teachers, teacher receipt of PD in the five essential components of reading 
instruction, teacher receipt of coaching, minutes of reading instruction per day, provision of extra 
classroom practice for struggling readers, and use of assessments to inform classroom practice. 33 

These regression analyses address the question whether overall the program “worked” in terms of having 
an effect on the implementation of key components of scientifically based reading instruction, in the 
school and in the classroom. These results serve as a “qualifier” to the small number of specific 
hypothesis tests shown in impact tables. 

Part 4: Statistical Precision 

The statistical precision of an impact estimator is its ability to detect true intervention effects when they 
exist. A common way to represent statistical precision is a minimum detectable effect. This measure 
indicates the smallest true effect that an estimator has a “good chance” of detecting. The current analysis 
uses the common convention of defining a minimum detectable effect as the smallest true program effect 
(impact) that has an 80 percent chance of being found to be statistically significant (i.e., it has 80 percent 
statistical power) at the 0.05 level of statistical significance for a two-tailed test of the null hypothesis of 
no effect. When a minimum detectable effect is expressed as a standardized effect size (in standard 
deviation units), it is usually referred to as a minimum detectable effect size (MDE). 

Exhibit B.8 lists the minimum detectable effect (or effect size) for full-sample estimates of program 
impacts on key study outcomes when the data are pooled across the school years for which data are 
available. These minimum detectable effects are based on the experience of students and schools in the 
study sample during the follow-up period, and not on the initial assumptions that guided the study design. 
Hence, the findings in Exhibit B.8 represent the actual precision of the present design as it materialized in 
the field. 34 



32 Six schools were dropped from these analyses because both survey outcomes were missing. All other schools had a valid 
response on at least one of the survey outcomes. Missing values on individual survey outcomes ranged from 0% to 7% overall 
(RF: 0% to 1% and non-RF: 0% to 14%). Missing values were imputed as the random assignment group mean. 

33 All classrooms had a valid response on at least one of the six survey outcomes. Missing values on individual survey outcomes 
ranged from 0% to 3% overall (RF: 0% to 2% and non-RF: 0% to 5%). Missing values were imputed as the random 
assignment group mean. 

34 Because for the present foil sample the number of degrees of freedom for estimating the standard error of an impact estimator 
is well beyond 30, the minimum detectable effect of an estimator equals 2.8 times its standard error. For further discussion see 
Bloom, H. S. (1995) “Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs,” 
Evaluation Review, Vol. 19, No. 5, pp. 547-556. 



Final Report: Methods 



B-21 





Exhibit B.8: Minimal Detectable Effects for Full Sample Impact Estimates 






Grade 1 


Grade Level 
Grade 2 


Grade 3 


Panel 1 

Student Achievement 

Reading Comprehension 
Mean Scaled Score 


7.62 


6.41 


5.93 


Effect Size 


0.16 


0.15 


0.15 


Percent at or above Grade Level 


7.22 


6.67 


6.70 


Decoding 

Mean Standard Score 


3.14 






Effect Size 


0.21 


-- 


- 


Panel 2 

Instructional Outcomes 

Instruction in the Five Dimensions Combined 
Minutes 


6.84 


6.88 




Effect Size 


0.33 


0.32 


- 


Percentage of Intervals in Five Dimensions with 
Highly Explicit Instruction 
Percentage 


3.88 


4.06 




Effect Size 


0.22 


0.21 


- 


Percentage of Intervals in Five Dimensions with 
High Quality Student Practice 
Percentage 


3.50 


3.49 




Effect Size 


0.21 


0.19 


-- 


Panel 3 

Student Engagement with Print 

Percentage of Students Engaged with Print 


8.18 


8.15 




Effect Size 


0.28 


0.28 


- 



NOTES: 



The complete Reading First Impact Study (RFIS) sample includes 248 schools from 18 sites (17 school districts and 1 
state) located in 13 states. 125 schools are Reading First schools and 123 are non-Reading First schools. For grade 2, in 
2006 one non-RF school could not be included in the analysis because test score data were not available. For grade 3 in 
2007, one RF school could not be included in the analysis because test score data were not available. 

Minimal detectable effects are based on the standard errors of the impact estimates for the foil sample pooled across three 
school years (except for Student Engagement with Print, which is based on two years of data) divided by the actual 
standard deviation of the outcome for the non-Reading First Schools pooled across the spring 2005, fall 2005, and spring 
2006 data. 

Impact estimates are statistically adjusted (e.g., take each school's rating, site-specific funding cut-point, and other 
covariates into account) to reflect the regression discontinuity design of the study. 

EXHIBIT READS: The minimal detectable effect of the Reading First program on reading comprehension for a 
mean scaled score in grade 1 is 7.62 scaled score points. The minimal detectable effect of the Reading First 
program on reading comprehension for a mean scaled score in grade 2 is 6.41 scaled score points. The minimal 
detectable effect of the Reading First program on reading comprehension for a mean scaled score in grade 3 is 
5.93 scaled score points. 

SOURCES: RFIS SAT 10 administration in the spring of 2005, 2006, and 2007 as well as from state/district education 
agencies in those sites that already use the SAT 10 for their standardized testing (i.e., FL, KS, MD, OR): RFIS TOSWRF 
administration in spring 2007; RFIS Instructional Practice in Reading Inventory, spring 2005, fall 2005, spring 2006, 
fall 2006, and spring 2007; and RFIS Student Time-on-Task and Engagement with Print, fall 2005, spring 2006, fall 2006 
and spring 2007. 



B-22 



Final Report: Methods 








The three panels in the exhibit present minimum detectable effects for the Tier 1 outcomes of the present 
study. The three columns in the exhibit present minimum detectable effects for grades one, two, and three 
separately. 

The top panel focuses on measures of student reading comprehension and decoding. Findings in this 
panel indicate that the present study design and impact estimation model have minimum detectable effects 
for reading comprehension that range from approximately 6 to 8 scaled score points, which corresponds 
to 0.15 to 0.16 standard deviations or about 7 percentage points. The minimal detectable effect for 
decoding was approximately 3 scaled score points, which corresponds to 0.21 standard deviations. These 
findings indicate that the present study achieved its goal of providing minimum detectable effect sizes 
that are no larger than 0.20 standard deviations for estimates of the impacts of Reading First on student 
reading comprehension. 35 

Findings in the second panel of the exhibit indicate that the minimum detectable effect for instructional 
time spent in the five dimensions of reading instruction in grades 1 and 2 is approximately 7 minutes, 
which corresponds to 0.32 to 0.33 standard deviations. 

Minimum detectable effects for the percentage of instructional intervals in the five dimensions that 
exhibited highly explicit instruction or that exhibited high quality student practice ranged from 
approximately 3 to 4 percentage points. The minimum detectable effect on the percentage of students 
engaged with print was between 1 0 and 1 1 percentage points, roughly twice as large as that for the 
preceding two measures. 

On balance, the statistical precision of the present study design and its analytic framework achieve the 
initial goals of the study’s design. The precision is adequate for full-sample impact estimates, which are 
the primary focus of the present study. 

Part 5: Handling Missing Data 

This section describes how the study handled missing data for each outcome measure and for covariates 
used in analyses. 

Surveys 

The study imputed values for several survey variables. When Reading Coach survey data were missing at 
the item-level, and the identical questions had been asked of school principals, principal data were used 
when available. The study used principals’ responses for 13 schools (all non-Reading First) without 
Reading Coach surveys on the “availability of differentiated instructional materials for struggling readers” 
outcome. The study also imputed values of 0 for the “amount of time dedicated to serving as K-3 reading 
coach” for six (non-Reading First) schools that the study had confirmed did not have reading coaches. 

The study imputed data from other RF schools about the presence of a scheduled reading block in grades 
K-3 and about the length of the K-3 reading block when all other RF schools in the site had reported 
having a reading block and that the block was the same length (n=2). In one instance, where all other RF 



See Gamse et al. (2004). 



Final Report: Methods 



B-23 





schools in a site reported using the same core reading program, the study imputed the core reading 
program for that school using data from other RF schools in that site. 

Missing data rates ranged from 0.1 to 3.3 percent for teacher survey outcomes (RF: 0.1 to 1.0 percent; 
non-RF: 0.0 to 4.9 percent) and 1.3 to 2.8 percent for reading coach and/or principal survey outcomes 
(RF: 0.0 to 1.6 percent; non-RF: 2.7 to 4.1 percent). Survey constructs (i.e., those outcomes comprised of 
multiple survey items) were constructed only for observations with complete data, with one qualification: 
the construct “minutes spent on reading instruction per day” was calculated as the total number of minutes 
reported for the previous week (across a maximum of 5 days) divided by the number of days with non- 
missing values. For this construct, teacher surveys with missing data for every day of the previous week 
were eliminated (0.9 percent). 

Classroom Observations: IPRI 



No imputations were made for observations that were missing in their entirety. Imputations were made for 
a small proportion of intervals within observations during data cleaning. When an individual observation 
record (comprised of 35 successive intervals, on average) contained gaps in time or blank intervals, the 
record was filled in to make observation interval times internally consistent, using verbatim information 
provided by the observer on classroom activities. When no such information was available, gaps in time 
were filled in with intervals coded as non-instructional. When an observation contained more than one 
blank interval, and the same instructional activity (at the dimension or Part B levels) had been recorded in 
the intervals immediately before and after the blank interval, the blank intervals were post-coded to be the 
same as the preceding and successive intervals’ activities. If the surrounding intervals were not identical 
to each other, the bla nk intervals remained blank. In the pooled sample (spring 2005, fall 2005, spring 
2006, fall 2006, spring 2007), imputations were made in 402 classroom observations (2.8 percent). Of 
those 402 observations, 205 were in Reading First classrooms and 197 were in non- Reading First 
classrooms, which corresponds to 2.9 percent of all Reading First classrooms and 2.8 percent of non- 
Reading First classrooms. 



Classroom Observations: STEP 



No data imputation was done for the STEP. Missing data were handled in the following manner. If 
classrooms had missing values for all three sweeps, they were dropped from the analytic sample. A sweep 
was coded as missing if the class was in transition between activities or the entire class was listening to a 
story. Only 78 classrooms (1 percent) were given a missing code data for all three sweeps. Of the 78 
classrooms, 23 were Reading First and 55 were non-Reading First, which corresponds to 1 percent of 
Reading First classrooms and 2 percent of non-Reading First classrooms in the pooled analytic sample. In 
cases where classrooms had one or more non-missing values for the three sweeps, the percentage of 
students engaged with print was calculated by averaging across the number of sweeps available for that 
classroom. For the pooled analytic dataset (fall 2005, spring 2006, fall 2006, and spring 2007), 70 percent 
of classrooms had three sweeps of data; 23 percent had two sweeps of data; 5 percent had one sweep of 
data; and 1 percent were missing all three sweeps. 36 



36 Numbers do not add to 100% due to rounding. 



B-24 



Final Report: Methods 





Student Reading Achievement: SAT 10 Reading Comprehension Subtest 



No data imputation was done for the SAT 10. Students test scores were coded as missing and excluded 
from the analytic sample if they were deemed invalid according to SAT 10 scoring guidelines. For the 
pooled sample (Spring 2005, Spring 2006, Spring 2007), this amounted to 222 student test scores (0.2 
percent). Of the missing scores, 92 were Reading First and 130 were non-Reading First, which 
corresponds to 0.1 percent of Reading First test scores and 0.2 percent of non-Reading First scores. 

Data were imputed for three covariates used in student achievement analyses — student age, gender, and 
date of testing. If missing, student age and gender were imputed by using school-by-grade means, except 
when data on an entire grade within a school was missing, in which case the district-by-grade mean was 
used. When an imputed covariate was used in analysis, a dummy variable indicating the imputed 
observations was also included. For student age, values were imputed for 0.35 percent (0.19 percent 
Reading First and 0.53 percent non-Reading First) of the pooled sample. For gender, values were imputed 
for 0.36 percent (0.30 percent Reading First and 0.43 percent non-Reading First) of the pooled sample. 
For the date of testing covariate, which was recorded at the classroom level, missing values were imputed 
using the school mean. Values were imputed for 0.79 percent (1.44 percent Reading First and 0.06 
percent non-Reading First) of the pooled sample. 

Student Reading Achievement: TOSWRF 

No data imputation was done for the TOSWRF. Students were coded as missing and were excluded from 
the analytic sample if birth dates were missing, or out of range (meaning that a standard score could not 
be computed for those students), or if students did not follow test instructions. Of the 465 test scores (5 
percent) excluded from the analytic sample, 233 were Reading First and 232 were non-Reading First, 
which corresponds to 4 percent of Reading First scores and 4 percent of non-Reading First scores in the 
analytic sample. 

Covariates used in the TOSWRF analyses were imputed in the same manner as described above for the 
SAT 10 analyses. 



Final Report: Methods 



B-25 





Appendix C: Measures 



Appendix C, Parts 1 through 5, describe the data collection instruments and assessments used to create 
measures for all outcome domains assessed in the RFIS. These instruments and assessments include 
Reading Coach and Teacher Surveys, the Instructional Practice in Reading Inventory (IPRI), the Global 
Appraisal of Teaching Strategy (GATS), Student Time-on-Task and Engagement with Print (STEP), and 
assessments of students’ reading performance (SAT 10 and TOSWRF). Parts 1 through 5 also include 
relevant information on properties of instruments, data collection procedures, and response rates. 

Appendix C, Part 6 contains copies of each of the RFIS data collection instruments. 

Part 1 : Reading Coach and Teacher Surveys 

Description of the Instruments 

The RFIS developed survey instruments for reading coaches and classroom teachers in grades 1 through 3 
to learn about how schools were implementing scientifically based reading programs. The surveys for the 
RFIS and the Reading First Implementation Evaluation 1 contain identical items in order to facilitate 
comparisons between the purposive RDD sample of the current study and the nationally representative 
sample of Reading First schools and personnel surveyed by the Implementation Evaluation. 

The Reading Coach Survey targeted school level individuals designated as reading coaches, reading 
specialists, literacy coordinators or Title I/resource teachers. It included items on the coach’s background 
and experience, core and supplemental reading materials, professional development offered to grades 1-3 
teachers, specific coaching activities, characteristics of reading instruction in the school, changes that 
have taken place in reading instruction, and areas needing improvement. See Appendix C, Part 7, 

Exhibit C.17 for a copy of the Reading Coach Survey. 

The Teacher Survey addressed student characteristics, reading instruction (e.g., materials, content, time 
allocation), assessment, interventions for struggling readers, participation in reading-related professional 
development, and collaboration and support from other teachers and staff. The surveys provide self- 
reported information on the instructional emphases across grades. See Appendix C, Part 7, Exhibit C. 18 
for a copy of the Grade 2-3 Teacher Survey. 

Administration Procedures and Response Rates 

Surveys were mailed in March 2007 to building level study liaisons who then distributed sealed envelopes 
to an average of nine classroom teachers (three per grade level), the school's reading coach, and the 
building principal. Respondents were asked to return the completed surveys within two to three weeks. 
Follow-up was conducted to encourage potential respondents to complete and return the surveys. 



1 The Reading First Implementation Evaluation, commissioned by the Policy and Program Studies Service at the U.S. 
Department of Education, collected survey data from principals, teachers and reading coaches in nationally representative 
samples of RF schools and non-RF Title I schoolwide project (SWP) schools in the 2004-05 and 2006-07 school years. 



Final Report: Measures 



C-1 





Exhibit C. 1 provides the response rate and sample size for each survey. In 2007, 227 reading coaches and 
1,792 teachers completed a survey. The effective Reading Coach Survey response rate was 99 percent for 
RF schools and 89 percent for non-RF schools. The effective Teacher Survey response rate was 87 
percent for RF schools and 83 percent for non-RF schools. 



Exhibit C.1: Survey Data Collection: School, Reading Coach, and Teacher Sample Information 



Reading Coach Survey 



RFIS Schools 

RF: 125 Non-RF: 123 Total: 248 

^ 



Reading Coaches Who Did Not Meet Criteria 1 


Spring 2005 


Spring 2007 


RF: 1 


1 


Non-RF: 14 


5 


Total: 15 


6 




Reading Coaches Who Responded to Survey 



RF: 

Non-RF: 

Total: 



Spring 2005 

118(95%) 
79 (72%) 
197 (85%) 



Spring 2007 

123 (99%) 
105 (89%) 
227 (94%) 



RF: 


Analytic Sample 2 

Spring 2005 Spring 2007 

118 (100%) 123 (100%) 


Non-RF: 


79 (100%) 


105 (100%) 


Total: 


197 (100%) 


228 (100%) 



1 Reading coach respondents who did not meet criteria included individuals who indicated that they did not serve any school(s) as 
a reading coach who provided ongoing training and support to school staff in delivering effective reading instruction. 

2 All completed surveys were used in the analytic sample. Information on item-level response rates is presented on tables where 
applicable, and Appendix B, Part 5 describes the overall approach to handling missing survey data. 

SOURCE: Reading First Impact Study Reading Coach Surveys, spring 2005 and spring 2007. 



C-2 



Final Report: Measures 



Exhibit C.1: Survey Data Collection: School, Reading Coach, and Teacher Sample Information 
(continued) 



Teacher Survey 



RFIS Schools 

RF: 125 Non-RF: 123 Total: 248 







Teachers Who Did Not Meet Criteria 1 




Spring 2005 


Spring 2007 


RF: 


16 


123 


Non-RF: 


21 


81 


Total: 


37 


204 





Teachers Who Met Criteria 


Eligible Teachers Not Selected 




Spring 2005 


Spring 2007 


Spring 2005 


Spring 2007 


RF: 


1,507 


1.388 ^ 


RF: 0 


322 


Non-RF: 


1,468 


1,358 1 ^ 


Non-RF: 0 


317 


Total: 


2,975 


2,746 


Total: 0 


639 






Eligible Teachers Selected 



RF: 

Non-RF: 

Total: 



Spring 2005 

1,507 

1,468 

2,975 



Spring 2007 

1,066 

1,041 

2,107 




Teachers Who Responded to Survey 2 


RF: 

Non-RF: 

Total: 


Spring 2005 

1,076 (71%) 
961 (65%) 
2,037 (69%) 


Spring 2007 

927 (87%) 
865 (83%) 

1 ,792 (85%) 






Analytic Sample 




RF: 

Non-RF: 

Total: 


Spring 2005 

1,344 (100%) 
1,315 (100%) 
2,659 (100%) 


Spring 2007 

1,400 (100%) 
1,342 (100%) 
2,742 (100%) 



NOTES: 

Respondents who did not meet criteria or were not selected included student teachers, substitute teachers, or teachers whose 
classrooms were not observed (for grades 1 and 2) or tested (grades 1, 2, and 3). 

2 A total of 23 teachers (15 in 2004-05, 8 in 2006-07) returned surveys but were dropped because they indicated that they did not 
teach reading or grades 1, 2, or 3. 

3 All completed surveys were used in the analytic sample. Information on item-level response rates is presented on tables where 
applicable, and Appendix B, Part 5 describes the overall approach to handling missing survey data. 

SOURCE: Reading First Impact Study Teacher Surveys, spring 2005 and spring 2007. 



Final Report: Measures 



C-3 



Composition, Scale, Internal Consistency and Scientifically Based Research 
Support 



Exhibit C.2 reports the composition, metric, specifications, and internal consistency of the survey 
outcomes. Exhibit C.3 includes information on the Reading First legislation and guidance that supports 
the survey outcomes. 

Part 2: Classroom Instruction: The Instructional Practice in Reading 
Inventory (IPRI) 

Background 

To measure the impact of Reading First on classroom instruction, the RFIS team conducted classroom 
observations in both Reading First and non-Reading First (non-RF) classrooms. The primary instrument 
used to evaluate instruction was the Instructional Practice in Reading Inventory (IPRI). The RFIS Team 
was unable to identify an existing observational instrument that fulfilled all of the study requirements; 
consequently, the RFIS Team developed the IPRI specifically for the RFIS. The IPRI is designed to 
measure first- and second-grade teachers’ use of instructional behaviors informed by scientifically based 
reading research (SBRR), as described in the National Research Council’s report (Snow, Bums, and 
Griffin, 1998) and the National Reading Panel report (National Institute of Child Health and Human 
Development, 2000). In particular, the IPRI focuses on instruction in the five dimensions of reading 
instruction emphasized by SBRR (phonemic awareness, decoding/phonics, fluency, vocabulary, and 
comprehension). Exhibit C.4 gives specific examples of instructional activities associated with each of the 
five dimensions. 

The development of the IPRI relied on several sources, including (1) research on the components of 
effective elementary grade reading instruction (e.g., Kamil, 2004; National Institute of Child Health and 
Human Development, 2000; Snow, Bums and Griffin, 1998; Stahl, 2004); (2) reviews of existing 
instruments (among the instmments reviewed were the following: The Instructional Content Emphasis 
(ICE) [Edmonds and Briggs, 2003]; Foorman and Schatschneider direct observation system and 
instruments from the Center for Academic and Reading Skills (CARS) [Foorman and Schatschneider, 
2003]; English Language Learner Classroom Observation Instrument (ELLCOI) [Haager et al., 2003]; 
Teachers ’ Instructional Practice (TIP) [Carlisle and Scott, 2003]; Utah ’s Profile of Scientifically based 
Reading Research [Dole, et al., 2001]; The Classroom Observation Record [Abt Associates and RMC 
Research, 2002]; and Observation Measure of Language and Literacy Instruction (OMLIT), developed by 
Abt Associates as part of the Even Start Classroom Literacy Interventions and Outcomes (CLIO) Study 
[Goodson et al., 2004]); and (3) research on the development of classroom observation instmments 
(Vaughn and Briggs, 2003). 2 



2 For a comprehensive description of the development of the IPRI, see Dwyer et al., 2007. 



C-4 



Final Report: Measures 





Final Report: Measures C-5 



Exhibit C.2: Composition, Metric, Specifications, and Internal Consistency of Survey Outcomes 



Domain 

Survey Outcome 

Individual Items Comprising the Outcome (as applicable) 


Survey, 

ltem(s) 


Metric 


Outcome Specifications 1 


Internal 

Consistency 2 


Professional Development (PD) in SBRI 


1. Amount of PD in reading received by teachers 

Attended short, stand-alone training or workshop in reading 

(half-day or less) 

Attendance longer institute or workshop in reading (more 

than half-day) 

Attended a conference about reading (might include multiple 

short offerings) 


Teacher, 
Cl: a, b, d 


Hours 


Sum of hours across 3 items 


0.22 


2. Teacher receipt of PD in the five essential components of 
reading instruction 

In phonemic awareness 

In phonics 

In vocabulary 

In fluency 

In comprehension 


Teacher, 
C4: a-s 


0-5 scale 


Each component (e.g., fluency) was scored 
dichotomously (1=teacher received PD in at 
least one of the topics listed, 0= teacher did 
not receive PD in any of the topics listed). 
Sum of 5 dichotomously scored components 
(1=addressed, 0=not addressed). 


0.86 


3. Teacher receipt of coaching 


Teacher, 
C2: a 


Proportion 


Dichotomous variable (1=teacher received 
assistance from reading coach, 0=teacher 
did not receive assistance from reading 
coach or not available) 


N/A 


4. Amount of time dedicated to serving as K-3 reading coach 


Reading 
Coach, B3 


Percent 


N/A 


N/A 


Amount of Reading Instruction 


5. Minutes spent on reading instruction per day 

Last week, approximately how many minutes per day did you 

devote to reading instruction? Reported separately for each 
of the five weekdays. 


Teacher, 

B1 


Minutes 


Total number of minutes of reading 
instruction for last week divided by number 
of days with non-missing values. 


0.99 



Supports for Struggling Readers 



6. Availability of differentiated instructional materials for 
struggling readers 3 










Use separate program materials in interventions 


Reading 




Dichotomous variable (1=E1: a, b, d, or e 




Use core reading program with supplemental materials 


Coach/ 

Principal, 


Proportion 


are available for struggling readers, 0=only 
the core reading program is available for 


N/A 


Use core reading program only 

Use reading materials written in ELLs’ home language 

Use alternative materials designed for ELLs 


El: a-e 




struggling readers, El: c) 














C-6 Final Report: Measures 



Exhibit C.2: Composition, Metric, Specifications, and Internal Consistency of Survey Outcomes (continued) 



Domain 

Survey Outcome 

Individual Items Comprising the Outcome (as applicable) 


Survey, 

ltem(s) 


Metric 


Outcome Specifications 1 


Internal 

Consistency 2 


7. Provision of extra classroom practice for struggling 
readers (over the past month) 

Extra practice in the classroom with phonemic awareness 

Extra practice in the classroom with phonics 

Extra practice in the classroom with fluency 

Extra practice in the classroom with comprehension 


Teacher, 
B8: b-e 


0-4 scale 


Sum of 4 dichotomously scored items 
(1=received, 0=did not receive) 


0.77 



Use of Assessments 



8. Use of assessments to inform classroom practice 

Use test results to organize instructional groups 


Teacher, 




Sum of 3 dichotomously scored items 




Use tests to determine progress on skills 

Use diagnostic tests to identify students who need reading 

intervention services 


B5: u, w, y 


0-3 scale 


(1=central, 0=small or not part of reading 
instruction) 


0.60 



NOTES: 

Missing data rates ranged from 0.1 to 3.3 percent for teacher survey outcomes (RF: 0.1 to 1.0 percent; non-RF: 0.0 to 4.9 percent) and 1.3 to 2.8 percent for reading coach 
and/or principal survey outcomes (RF: 0 to 1.6 percent; non-RF: 2.7 to 4.1 percent). Survey constructs (i.e., those outcomes comprised of more than one survey item) were 
computed only for observations with complete data, with one qualification: for the construct “minutes spent on reading instruction per day,” the mean was calculated as the total 
number of minutes reported for last week (over a maximum of 5 days) divided by the number of days with non-missing values. Only those teacher surveys with missing data for 
all 5 days were missing (0.9 percent). 

2 Internal consistency was calculated using Cronbach’s raw alpha for survey outcomes other than single dichotomous outcome variables. 



This survey item was asked of both reading coaches and principals. In cases where reading coach survey data were not available, the study used principal responses for those 



schools (n=13). 



EXHIBIT READS: The outcome variable “amount of PD in reading received by teachers” consisted of three individual items from the RFIS Teacher Survey, item Cl. 
The sum of these three items represents the total the number of hours of professional development in reading attended by teachers in the form of short 
trainings/workshops, longer institutes/ workshops, and conferences. The internal consistency reliability was 0.22. 

Sources: Reading First Impact Study Reading Coach, Principal, and Teacher Surveys. 









Final Report: Measures 



Exhibit C.3: Reading First Legislative Support and Guidance for Survey Outcomes 



Domain 

Survey Outcome 

Individual Items Comprising the Outcome (as applicable) 


Survey, ltem(s) 


Reading First Legislation 1 or Guidance 2 


Professional Development (PD) in SBRI 


1. Amount of PD in reading received by teachers 

Attended short, stand-alone training or workshop in reading (half-day or less) 

Attendance longer institute or workshop in reading (more than half-day) 

Attended a conference about reading (might include multiple short offerings) 


Teacher, 
Cl: a, b, d 


Legislation, Section 1202(c)(7)(A)(iv) 
Guidance, p. 7, C-3 


2. Teacher receipt of PD in the five essential components of reading instruction 

In phonemic awareness 

In phonics 

In vocabulary 

In fluency 

In comprehension 


Teacher, 
C4: a-s 


Legislation, Section 1202(c)(7)(A)(iv)(l) 
Guidance, p. 7, C-3 


3. Teacher receipt of coaching 


Teacher, C2: a 


Legislation, Section 1 202(c)(7)(A)(iv)(l II) 
Guidance, p. 7, C-3 


4. Amount of time dedicated to serving as K-3 reading coach 


Reading Coach, 
B3 


Legislation, Section 1202(c)(7)(A)(iv)(lll) 
Guidance, p. 7, C-1, C-3 


Amount of Reading Instruction 


5. Minutes spent on reading instruction per day 


Teacher, B1 


Guidance, p. 7, C-1 


Supports for Struggling Readers 


6. Availability of differentiated instructional materials for struggling readers 

Use separate program materials in interventions 

Use core reading program with supplemental materials 

Use core reading program only 

Use reading materials written in ELLs’ home language 

Use alternative materials designed for ELLs 


Reading Coach/ 
Principal, El : a-e 


Guidance, p. 7, C-2 


7. Provision of extra classroom practice for struggling readers 

Extra practice in the classroom with phonemic awareness 

Extra practice in the classroom with phonics 

Extra practice in the classroom with fluency 

Extra practice in the classroom with comprehension 


Teacher, B8: b-e 


Legislation, Section 1202(c)(7)(A)(ii)(ll)(ee) 










C-8 Final Report: Measures 



Exhibit C.3: Reading First Legislative Support and Guidance for Survey Outcomes (continued) 


Domain 

Survey Outcome 

Individual Items Comprising the Outcome (as applicable) 


Survey, ltem(s) 


Reading First Legislation 1 or Guidance 2 


Use of Assessments 


8. Use of assessments to inform classroom practice 

Reading assessments are used to screen students for reading difficulties 

Diagnostic assessments are used to identify strengths and weaknesses of struggling readers 

Reading assessments are used to monitor student progress 


Teacher, B5: 
u, w, y 


Legislation, Section 1202(c)(7)(A)(i) 
Guidance, p. 7, C-1, C-2, C-4 



NOTES: 

1 The legislation for Reading First is contained in the No Child Left Behind Act of 2001, ESEA, 2001, Title 1, Part B, Subpart 1. 

2 Guidance for the Reading First Program is provided by the U.S. Department of Education (2002). 

EXHIBIT READS: The outcome variable “amount of PD in reading received by teachers”, which consists of three individual items from the RFIS Teacher Survey 
(Question Cl), was supported by both the Reading First legislation [No Child Left Behind Act of 2001, ESEA, 2001, Title 1, Part B, Subpart 1, Section 1202(c)(7)(A)(iv)] and 
the Guidance for the Reading First program (U.S. Department of Education, 2002, p. 7, C-3). 

Sources: Reading First Impact Study Reading Coach, Principal, and Teacher Surveys. 








Exhibit C.4: Examples of Instruction in the Five Dimensions of Reading Instruction 



Phonemic 


The teacher is working with a group of four students. The teacher says, “Listen to me. The word is 
hat. If 1 take away the /h/ sound at the beginning, 1 have the word at. Then if 1 add a Ibl sound to the 
beginning 1 get bat. Now you try. The word is sat If we take away the Isl sound what word do we 
have?” [students respond orally], “That’s right, at. Now add a Ikl sound to the beginning. What word? 
That’s right, cat” 


awareness 


The teacher is working with a pair of students. He asks students to identify the final sound in each of 
a list of 10 words. The students respond orally to each prompt from the teacher: “Crack. What’s the 
last sound in crack ? [students respond orally]. Good. Ok: Take. What’s the last sound? [students 
respond orally]. Ok, next: kite. What’s the last sound? [students respond orally]. How about flight 1 
[students respond orally]. That’s right, /t/, /t/ is the last sound in flight. “ 


Phonics 


A group of 16 students has assembled in front of the classroom blackboard. The teacher writes the 
letters oi on the board and says, “Ok, now today we’re going to be learning about words that have o, 
i in them. When you see these vowels together, they make the loyl sound. Here’s an example.” The 
teacher writes a sentence on the board: / want Roger to join my club. She underlines the letters oi 
in the word join. “This word is join. ‘1 want Roger to join my club. See that oil What sound does oi 
make?” [students respond, some of them incorrectly], “Ok, listen carefully. Not /eye/... no, of makes 
the loyl sound. Everyone try that: loyl!' [students in unison say /oy/]. “Ok, good, now what’s this word 
[she points to join]?” The students pronounce join correctly. “Excellent, ok, let’s try another one.” 

She writes the word coin on the board. “Boys and girls, look at that oi in the word. Sound out this 
word for me.” 


Six students are seated with a teacher. Each student has a set of individual magnetic letters and a 
metal tray. The teacher is asking students to form words that she dictates orally: “Ok, listen to the 
word, think about the sounds and what letters go with those sounds. Remember that we’ve been 
working with the Ibl sound and its spellings. We know that one way to spell that is with o, a. Try to 
make the word using your letters. The first word is goat. Use your letters to make the word goat." 
Students assemble their letters and the teacher checks each student’s work. “Good. Everyone used 
o, a to spell goat. Ok, let’s try another word: float.” Students form the word with their letters. “Ok, 
good! You’re doing very well. Now, we also know another way to spell some words with the long Ibl 
sound. Remember the silent e rule? It makes the vowel say its name. So, to spell the word tote, 
Arthur, tell me how we’d write totel" 




The teacher gives a definition for the word swift and uses it in a sentence: “Swiftly? Something that 
is swift is moving very fast, rapidly. So, remember when we learned about how fast cheetahs can run 
over land? Well, we might say, ‘the cheetah ran swiftly across the ground, quickly catching up to the 
tiger.’” 


Vocabulary 


As they are reading a story in class, students come across the word debating, and the teacher 
discovers that they do not know what it means. The teacher defines debating by contrasting it with 
more familiar words (chatting and talking). The teacher says, “When two people are debating 
something, it means that they are talking about the reasons to do something and the reasons not to 
do something — so in our story, John and Sara are debating whether or not to go on a picnic. On the 
one hand, the weather is nice, but on the other hand they are thinking there may be a lot of ants. So 
they’re debating what to do. Chatting is different than debating. When you’re chatting with 
someone, you’re usually not trying to decide something, you’re just talking about things that aren’t 
too serious. You chat more to enjoy the talking, not really to decide something together.” 



Final Report: Measures 



C-9 








Exhibit C.4: Examples of Instruction in the Five Dimensions of Reading Instruction (continued) 





Roberto is reading orally from a passage about parrots and their habitat. When he reaches the end of 
the second paragraph, the teacher asks Roberto to read that same passage aloud again. When 
Roberto finishes, the teacher asks him to read the passage out loud a third time. 


Fluency 


The teacher assigns four students to pairs and distributes a page-long excerpt from a story they have 
been reading in class that week. Each pair of students also has a one minute timer. “Ok, now you 
each have a partner, and 1 want you to time your partner reading this passage out loud. Readers, you 
try to read as far as you can in one-minute. Timers, you keep track of the time and tell your partner to 
stop reading when time runs out. Then circle the last word the reader got to in the passage.” 


Comprehension 


A teacher pauses in the middle of a story about Shackleton’s Antarctic Voyage to ask students to 
reflect on what they have just read and draw some inferences about how one character might be 
feeling. “What do you think the captain is feeling? Let’s see. The story doesn’t tell us exactly, but the 
story says the ship is starting to break apart, i’d certainly be very worried for myself and my crew if 
my ship were breaking apart! 1 bet the captain is really worried. Let’s see... the story also says the 
captain ‘furrowed his brow.’ That means he made his forehead wrinkle or sort of frown. Some people 
do that when they’re worried. That could be a sign that the captain is worried. He certainly has 
reason to be worried.” 




The teacher introduces a comprehension strategy. “One thing you should always do when you read 
is constantly ask yourself questions about the story. Asking yourself questions is a strategy to help 
make sure you understand what you just read. Asking questions also helps you think about what 
might happen next. We’re going to practice using this strategy. At the end of every paragraph today, 
we’re going to come up with some questions and write them up here on the board. Some questions 
well be able to answer right away. But we might have other questions, too, and well need to read 
more of the story before we can find out how to answer those questions.” 



Overview of the IPRI 

The IPRI observation instrument is a booklet containing a series of individual IPRI forms, each of which 
corresponds to a three-minute observation interval. 3 Observation data for a given reading block are 
collected via sequentially-ordered IPRI forms that span the entire observation period (e.g., a 60-minute 
observation would be recorded on 20 sequential forms, one for each successive three-minute interval). 
During each three-minute interval, observers record any of the teacher’s instructional behaviors listed on 
the IPRI that occur during that interval. At the end of each three-minute interval (signaled by a pre- 
programmed vibrating wristwatch), observers turn to a new IPRI form and begin another three-minute 
interval, again recording the presence of targeted behaviors. 

Within a given three-minute interval, a particular behavior is coded only once, regardless of how often 
that behavior occurs within an interval. Recurrences of that same behavior are coded in each subsequent 
interval. If behavior x occurs in interval n, the observer circles the code for behavior x once during 
interval n. If behavior .v occurs in the next interval, n+1, the observer circles the code for behavior x 
during interval n+1. 



3 See Exhibit C. 19 for a copy of the IPRI instrument. 



C-10 



Final Report: Measures 








Structure of the IPRI Instrument 



Each IPRI form has four distinct parts: Part A, Part B, Part C, and Part D. Part A is divided into five 
color-coded sections that correspond to the five dimensions of reading instruction: phonemic awareness, 
decoding/phonics, fluency, vocabulary, and comprehension, respectively. Within each of these five 
sections are microcodes, specifically tailored to each of the five dimensions, which denote the following 
areas of interest: 

• the size of the student grouping to which instruction is delivered; 

• the use of any instructional support materials (e.g, manipulatives, pictures); 

• the teacher’s use of explicit instruction; 

• the teacher’s provision of practice opportunities for students; and 

• the teacher’s delivery of any corrective feedback or expansion of student responses. 

For example, within the phonemic awareness row, the IPRI microcodes for grouping are “whole class, 
large group, small group, pair, or individual”; for the use of various types of instructional supports, 
“teacher manipulative or kinesthetic, student manipulatives, kinesthetics”; and for corrective feedback, 
“teacher pinpoints what student(s) did incorrectly with sound(s) and gives correct response with or 
without students.” For the use of explicit instruction and the provision of practice opportunities for 
students, these areas of interest are often denoted by the combination of two or more microcodes. So, for 
example, if a teacher “demonstrates or models oral blending or segmenting with phonemes” in 
conjunction with “gives student(s) chance to practice oral blending or segmenting with phonemes,” it 
would be counted as explicit instruction. 

Part B of the IPRI contains codes to capture instruction or other activity outside the five dimensions, 
including: 

• Oral reading by students ; 4 

• Oral reading by teacher alone (without student accompaniment); 

• Silent reading; 

• Spelling; 

• Written expression; 

• Other language arts; 

• Assessment; 

• Non- literacy instruction; 

• Non- instruction; 

• Academic management; 

• Transitions between activities; 

• Interruptions to instruction for the purpose of managing student behavior. 



4 Oral reading under Part B is marked when the teacher has not clearly indicated the instructional purpose of the oral reading. If, 
however, oral reading is used to advance instruction in one of the five targeted dimensions of reading instruction (e.g., 
comprehension), then the oral reading is coded within the corresponding row in Part A of the IPRI. 



Final Report: Measures 



C-11 





Part C records teachers’ instructional errors that are not subsequently self-corrected. Part D records 
whether the teacher worked with a different small group of students than in any previous part of the 
observation. 5 



Training and Inter-rater Reliability of Classroom Observers 

Prior to each wave of data collection, field staff based in each of the RFIS sites attended a centralized, 
multi-day training on the IPRI and associated data collection protocols. The training curriculum included 
extensive practice coding a series of videotaped clips of real-time and unscripted classroom instruction 
that were filmed in RF and non-RF classrooms. The film clips were created specifically for the RFIS, and 
were edited to illustrate the codes included on the IPRI. Candidate observers conducted a live observation 
in a first or second grade classroom during the training session and received ongoing feedback, multiple 
opportunities for review, tutoring and other support throughout the training. 6 

One component of this training was that observers were required to pass two of three formal inter-rater 
reliability tests; each videotape used for reliability purposes was approximately 30 minutes in length. To 
calculate observers’ percent agreement with the master coding of each reliability tape, the RFIS Team 
used a procedure that reduces inflation in inter-rater reliability estimates due to chance agreement (see 
Kelly, 1977, cited in Suen and Ary, 1989). The inflation due to chance agreement is especially severe 
when some events (or codes) occur infrequently, as is the case with the IPRI. 7 As a result, observers were 
credited only for codes that occurred at least once in the reliability tape. In sum, if a behavior occurred at 
all during a 30-minute tape, observers were credited (or penalized) for correctly coding instances of the 
behavior and for correctly abstaining from coding behaviors that did not occur. Observers were not 
credited for abstaining from, nor penalized for, marking behaviors that never occurred throughout the 
entire reliability tape. 

For each potential observer, percent agreement with the master codes was calculated for each code 
individually; then agreement was aggregated across codes within the five sections in Part A and across 
codes within Part B. Finally, an aggregate overall percentage agreement across the five sections in Part A 
and codes within Part B was calculated. A report summarizing all of these measures of agreement (by 
individual code, by dimension, and overall) was prepared for each potential observer so that s/he (and the 
study team) could diagnose which codes had proven particularly troubling. Overall percent agreement 
was used to judge whether or not each observer had met the criterion for employment on the study. Only 
observers who successfully coded two of three videotaped reliability tests were hired. The mean overall 
percent agreement for observers was 88 percent (n=155 observers) in spring 2005 (for spring 2005 data 
collection). The mean overall percent agreement for observers was 90 percent (n=154 observers) in fall 
2005 (for fall 2005 and spring 2006 data collection). The mean overall percent agreement for observers 
was 90 percent (n=130 observers) in fall 2006 (for fall 2006 and spring 2007 data collection). 



5 Minor changes were made to the IPRI after the spring 2005 data collection and prior to the fall 2005 wave of data collection; 
these changes included elaborating upon some micro-behaviors within each of the five dimensions. 

6 For a detailed description of the classroom observer training, see Dixon et al. (2007). 

7 During each observation interval, an IPRI fonn contains 142 possible codes; typically, only a small subset of the behaviors 
occur during a given interval. Thus, most of the possible codes are infrequent within a single interval. Including all 142 codes 
per interval in the calculation of percent agreement severely inflates inter-rater reliability. 



C-12 



Final Report: Measures 





Data Collection 



Observations were conducted in first- and second-grade classrooms for two consecutive days during each 
classroom’s designated reading block. During the 2004-05 school year, the RFIS conducted two days of 
classroom observation in spring 2005. In the following study year (2005-06), a second round of 
observations was added, so that observers conducted observations for two consecutive days in the fall, 
and then again for two consecutive days in the spring. Again in 2006-07, observers conducted 
observations for two consecutive days in the fall, and then again for two consecutive days in the spring. 
The increased number of observations reflects a decision by the National Center for Education 
Evaluation/Institute of Education Sciences at the Department of Education to collect more data, both in 
terms of the number of observations and in terms of when during the year data could be collected. 

Observation scheduling was arranged by RFIS field supervisors via communication with each 
participating school’s study liaison. 8 Observers coded during the entire scheduled observation period, 
even when teachers appeared to be offering non-reading-related instruction. In those instances when 
reading instruction appeared to continue beyond the scheduled reading block, observers observed for up 
to an additional 30 minutes. Throughout observations, IPRI observers followed the actions and behaviors 
of classroom teachers. In classrooms with more than one adult present, observers determined beforehand 
who was the official teacher of record and which adult would be delivering that day’s reading instruction. 
The individuals responsible for delivering instruction were then followed for the observations whether or 
not they were the official teacher of record. Observations were rescheduled when the classroom teachers 
were absent or ill, although long-term substitutes replacing a teacher on an extended leave of absence 
(e.g., maternity, disability) were observed. 

The 248 schools in the RFIS study sample allowed for observations in 2,091 classrooms in 2004-2005, 
3,997 classrooms in 2005-2006, and 3,985 classrooms in 2006-2007. Of these, 1,917 classrooms met 
eligibility requirements for classroom observations in 2004-2005, 3,649 in 2005-2006, and 3,676 in 2006- 
2007. Classrooms were considered eligible to be in the study sample if they were not special education or 
English as a Second Language classes, if more than 75 percent of the students were in the target grades, 
and if the class was taught by the regular teacher or a long-term substitute. 

Of the eligible classrooms, the RFIS selected a final observation sample of 1,639 classrooms in 2004- 
2005, 2,770 in 2005-2006, and 2,814 in 2006-2007. Classrooms were sampled within schools, if, within 
each site as a whole, the number of classrooms exceeded an average of three classrooms per grade. Each 
classroom in the sample was expected to be observed two times during each of the three waves of data 
collection. The RFIS completed 96 percent of the expected classroom observations in 2004-2005, and 100 
percent in both 2005-2006 and 2006-2007. A flow chart of information on the RFIS IPRI sample and 
response rates is presented in Exhibit C.5. 



In schools that did not have a designated “reading block,” the RFIS Team asked the school’s study liaison when observers 
would be able to see typical reading, literacy, and/or language arts instruction in classrooms. In cases where reading instruction 
was delivered in two discrete blocks interrupted by other instruction or activities (e.g., lunch, recess, math instruction), field 
staff observed both blocks. 



Final Report: Measures 



C-13 





